DOCUMENT RESUME 



ED 048 912 



LI 002 721 



TITLE 



INSTITUTION 

SP0N3 AGENCY 

REPORT NC 
PUB DATE 
NOTE 



Automatic Dictionary Constt uct ion ; Part II of 
Scientific Report No, ISR-18, Information Storage 
and Retr ie val. . . 

Cornell Univ., Ithaca, N • Y * Dept, of Computer 
Science. 

National Library of Medicine IDHEW) , Bethesda, Md.; 
National Science Foundation, Washington, D*C. 

ISR- 1 8 £ Part II ] 

Oct 70 

1 2 4p. ; Part of LI 002 719 



EDRS PRICE 
DESCRIPTORS 

IDENTIFIERS 



EDRS Price MF-S0.65 HO-S6.S8 

Automation, *Dictionar ies, ♦‘Information Retrieval, 
Lexicography, ^Lexicology, *Search Strategies, 
♦Thesauri, Vocabulary, Word Lists 
On Line Retrieval Systems, *Saltons Magical 
Automatic Retriever of Texts, SMART 



ABSTRACT 

Part Two of the eighteenth report on Salton's 
Magical Automatic Retriever of Texts (SMART) project is composed <f 
tnro3 papers: The first: M The Effect of Common Words and Synonyms on 
Retrieval Perf ormance" by D. Bergmark discloses tnat removal of 
common words from the guery and document vectors significantly 
increases precision and that synonyms were more effective for recall 
than common words. Paper two: "Negative Dictionaries" by K. Bonwich 
and J. Aste- Tcnsmann discusses a rationale foe constructing negative 
diet ionarj.es and examines the retrieval results of experimentally 
produced dictionaries. The third paper: "Experiments in Automatic 
Thesaurus Construction for Information Retrieval" by G. Salton 
describes several new methods for automatic, or semi-automatic, 
dictionary construction, including procedures for the automatic 
identification of common words, and novel automatic grouping methods. 
The resulting dictionaries are evaluated in an information retrieval 
environment. (For the entire SMART project report see LI 002 719, tor 
Part One see LI 002 720 and for Parts 3-5 see LI 002 722 throug.i LI 
002 724.) (NH) 



EB 04&912 



PERMISSION TO REPRODUCE THIS COPY 
R GHTED MATERIAL HAS BEEN GRANTED 

TO ER'C AND ORGANISATIONS OPERATING 
IJNDEP AGREEMENTS WITH THE US OFFICE 
OF EDUCA‘10! FtIRTHFA fiEPROOJCTiON 
CGTSlDE THE ERIC SvSTEM REQ'i FES PER 
MISSION OF THE COPYRIGHT O'A'NE R 

Department of Computer Science 



Cornell University 



Ithaca, New York 14850 



+ t)iciio>iAr^ Oo#> 5 +f h 

"Pa.* X 

Scientific Report No. ISR-18 
INFORMATION STORAGE AND RETRIEVAL 
to 

The National Science Foundation 
and to 

The National Libraiy of Kedicine 



Reports on Analysis* Dictionary Construction, User 
Feedback, Clustering, and On-Line Retrieval 



<N 

Ithaca, New York 



October 5.970 

O 

© 



ERIC 



>IPARTMENT OF HIAltH, 
)UC*TlON* WUf ARE 
FFlCtOMbUCATiON 
'CUHtNI MAS BEEN REPflO 
EXjAC> ' Y AS SECE'VEU 
SONORORSANI7>T»ONOMG 
IT POINT ,' \ Of MEW OR OPlh 
f ATEO DO N /T NEUiSARILT 
fnt official office Of edu. 

POSITION CRPO'ltY 



Gerard Salton 
Project Director 



1 



© 

Copyright, 1970 
by Cornell University 

Use, reproduction, or publication, in whole or in part, is permitted 
for any purpose of the United States Government. 





ii 



SMART Project Staff 



Robert Crawford 
Barbara Galaska 
Eileen Gudat 
Marcia Kerchner 
Ellen Lundell 
Robert Peck 
Jacob Razon 
Gerard Salton 
Donna Williamson 
Robert Williamson 
Steven Worona 
Joel Zumoff 







3 



ERIC User ?lease Note: 



This Table of Contents outlines all 5 parts of Information Storage 
and Retrieval (ISR-18), which is available in its entirety as 
LI 002 719. Only the papers from Part Two are reproduced here 
as LI 002 721. See LI 002 720 for Part One and LI 002 722 thru 
LI 002 724 for Parts 3-5. 



TABLE OF CONTENTS 



SUMMARY 



Page 

xv 



PART ONE 

AUTOMATIC CONTENT ANALYSIS 



kr 009 T3C 



I. WEISS, s. F. 

"Content Analysis in Information Retrieval' 1 



Abstract . 1-1 

1. Introduction . 1-2 

2. ADI Experiments 1-5 

A) Statistical Phrases 1-5 

B) Syntactic Phrases 1-7 

C) Cooccurrence 1-9 

D) Elimination of Phrase List 1-12 

E) Analysis of ADI Results 1-20 

3. The Cranfield Collection 1-26 

4. The TIME Subset Collection 1-27 

A) Construction 1-27 

B) Analysis of Results 1-31 

5. A Third Collection! 1-39 

i 

6. Conclusion . 1-43 

References . 1-46 



II. S ALTON, G. 

"The ‘Generality* Effect and the Retrieval Evaluation for Large 
Collections” 




iv 



4 



TABLE OF CONTENTS (continued) 



II. continued 



Page 



Abstract II-l 

1. Introduction II-l 

2. Basic System Parameters ... II-3 

3. Variations in Collection size II>7 

A) Theoretical Considerations ......... II-7 

B) Evaluation Results II~10 

C) Feedback Performance * , 11*15 

4. Variation.* in Relevance Judgments. 11*24 

5. Summary 11*31 

References 11*33 



III. SALTON, G. 

"Automatic Indexing Using Bibliographic Citations" 



Abstract I1I-1 

1. Significance of Bibliographic Citations 111*1 

2. The Citation Test 111*4 

3. Evaluation Results 111*9 

References. . . . . . . ^ t a t % a , ili“19 

Appv r.dix 111-^0 



IV. WEISS, S. F. 

"Automatic Resolution of Ambiguities from Natural Language Text'* 



o 

ERJC 






V 



* 



TABLE OF CONTENTS (continued) 



IV* continued 



Page 



Abstract ..... IV-1 

1. Introduction. .............. IV-2 

2. The Nature of Ambiguities IV-4 

3. Approaches to Disambiguation 17-8 

4. Automatic Disairbiguation. IV- 14 

A) Application of Extended Template Analysis to 

Disambiguation IV-14 

B) The Disambiguation Process IV-15 

C) Experiments IV- 17 



D) Further Disambiguation Procv'ses IV-20 

5. Learning to Disambiguate Automatically IV-21 

A) Introduction . IV-21 

B) Dictionary and Corpus IV-21 

C) The Learning Process IV-23 

D) Spurious Rules IV-28 

E) Experiments and Results IV-30 



F) Extensions IV-46 

6. Conclusion IV-49 

References. . . • « IV-50 



PART TWO 

AUTOMATIC DICTIONARY CONSTRUCTION 



V. BERGMARK, D. 



o 

ERIC 



6 



Vi 



TABLE OF CONTENTS (continued) 



Page 



V. continued 

’’The Effect of Common Words and -Synonyms on Retrieval Performance' 1 



Abstract 



V-l 



1* Introduction . v-l 

2. Experiment Outline. ....... v-2 

A) The Experimental Data Base V-2 

B) Creation of the Significant Stem Dictionary. . . . V-2 

C) Generation of New Query and Document Vectors . . . V-4 

D) Document Analysis - Search and Average Runs. . . . V-5 

3. Retrieval Performance Results V* 7 



A) Significant vs. Standard Stem Dictionary 

B) Significant Stem vs. Thesaurus 

C) Standard Stem vc . Thesaurus . . 

D) Recall Results 

E) Effect of "Query Wordiness" on Search Performance. • 

F) Effect of Query Length on search Performance . 

G) Effect of Query Generality on Search Performance . 

H) Conclusions of the Global Analysis. ...... 

4. Analysis of Search Performance 

5. Conclusions ... 

6. Further Studies. .............. 

References. * 

Appendix I 

Appendix II 



V- 7 
V-9 
V-l 1 
V-l 1 
V-15 
V-l 5 
V-17 
V- 19 

V-20 
V-31 
V-32 
V-34 
V- 35 
V-39 




7 



vii 



TABLE OF CONTENTS (continued) 



VI. BONWIT, K. and ASTE-TONSMANN , J. 
"Negative Dictionaries" 



Page 



Abstract ...... VI-1 

1. Introduction VI-1 

2. Theory VI-2 

3. Experimental Results ..... VI-7 

4. t Experimental Method ...... VI-19 

A) Calculating Q. VI-19 

B) Deleting and Searching. VI-20 

5. Cost Analysis * VI-25 

6. Conclusions VI-29 

References VI-33 



VII. S ALTON f G. 

"Experiments in Automatic Thesaurus Construction for Information 
Retrieval" 



Abstract VII-1 

1. Manual Dictionary Construction VH-1 

2. Common Word Recognition VII-8 

3. Automatic Concept' Grouping Procedures VII-17 

4. Summary • • • • . ♦ VII-25 

References. . . . ** Vi I- 26 




8 



viii 



TABLE OF CONTENT 3 (conUnuec’) 



VIII. 



IX. 



0 




Page 

PARI 1 THREE 

USER FEEDBACK PROCEDURES 

tr ooa laa 

BAKER, T. P. 

"Variations on the Query Splitting Technique with Relevance 
Feedback" 



Abstract VIII-1 

1. Introduction VIII-1 

2. Algorithms for Query Splitting VIII-3 

3. Results of Experiment^] Runs VIII-11 

4. Evaluation VIII-23 

References V1JI-25 



CAPPS, B. and YIN, M. 

"Effectiveness of Feedback Strategies on Collections of 



Differing Generality" * 

Abstract ' IX~1 

1. Introduction IX-i 

2. Experimental Environment IX-3 

3. Experimental Results IX-8 

4. Conclusic \ IX-19 

References IX-23 

Appendix ix-24 



9 

ix 



TABLE OF CONTENTS (continued; 



X. KERCHNER, M. 

"Selective Negative Feedback Methods" 



Page 



Abstract X-l 

1. Introduction X-l 

2. Methodology *..... X-2 

3. Selective Nega* ive Relevance Feedback Strategies. . . X-5 

4. The Experimental Environment ......... X-6 

5. Experimental Results X-8 

6. Evaluation of Experimental Results X-13 

References X- 30 



XI. PAAVO LA t L. 

"The Use of Past Relevance Decisions in Relevance Feedback" 



Abstract xi-1 

1. Introduction. . XI-1 



2. Assumptions and Hypotheses XI-2 

3. Experimental Method XI“3 



4. Evaluation XI-7 

5. Conclusion • • V ........... XI-12 

References. ..... XI-14 




10 



x 



TABLE OF CONTENTS (continued) 



XII. 



o 

ERLC 



Page 



PART FOUR 
CLUSTERING METHODS 



Aw<ul A.WI « A* 

KX ooa 



JOHNSON, D. B. and LAFUENTE, J. M. 

"A Controlled Single pass Classification Algorithm with Application 
to Multilevel Clustering'' 



Abstract XII-1 

1. Introduction. . ............ XII-1 

2. Methods of Clustering XII-3 

3. Strategy XII-5 

4. The Algorithm . XII-6 

A) Cluster Size XII-8 

B) Number of Clusters ..... XII-9 

C) Overlap XII-10 

D) An Example XII-10 

5. Implementation XII-13 

*■ 

A) Storage Management XII-14 

6. Results XII-14 

A) Clustering Costs XII-15 

B) Effect of Document Ordering XII-19 

C) Search Results on Clustered ADI Collection . . . XII-20 

D) Search Results of Clustered Cranfield Collection . XII-31 

7. Conslusions XII-34 



li 

xi 



References 



XII-37 



TABLE OF CONTENTS (continued) 



age 

XIII. WORONA, S. 

"A Systematic Study of Query-Clustering Techniques: A 

Progress Report" 



Abstract XIIJ-1 

1. Introduction. . .... XIII-1 



2. The Experiment 



A) 

B) 

C) 

D) 

E) 



Splitting the Collection • • . . 

Phase 1: Clustering the Queries 

Clustering the Documents. 
Assigning Centroids 



Phase 2: 
Phase 3: 
Summary. 



3. 

4. 



Results ...... 

Principles of Evaluation. 



XII 1-4 

XIII-4 

XIII-6 

XIII-8 

XIII-12 

XITI-13 

XIII-13 

XII 1-16 



References, 
Appendix A, 
Appendix B 
Appendix C 



XII 1-22 
XIII-24 
XIII-29 
XII 1-36 



Available 
m u o ca oaH 

XIV. WILLIAMSON, D. and WILLIAMSON, R. 

"A Prototype On-Line Document Retrieval System" 



PART FIVE 

ON-LINE RETRIEVAL SYSTEM DESIGN 




12 

xii 



Abstract 



XIV-1 



TABLE OF CONTENTS (continued) 



Page 



XXV. continued 



1. Introduction , 

2. Anticipated Computer Configuration ....... 

3. On Ane Document Retrieval - A User’s View. . . . . 

4. Console Driven Document Retrieval — An Internal View 

A) The Internal Structure. ... 

B) General Characteristics of SMART Routines . . 

C) Pseudo-latching 

D) Attaching Consoles to SMART 

E) Console Handling — The Superivsor Interface 

F) Parameter Vectors 

G) The Flow of Control 

H) Timing Considerations . 

I) ■ Noncore Resident Files . * • 

J) Core Resident Files 

5. Consol — A Detailed Look 

A) Competition for Core . 

B) The SMART On-.line Console Control Block . . . . 

C) The READY Flag and the TRT Instruction . 

D) The Routines LATCH, CONSIN, and CONSOT . . . . 

E) CONSOL as a Traffic Controller 

F) A Detailed View of CYCLE 

6. Summary ...... « 

Appendix ••••••• 



XIV-1 
XIV- 2 
XIV- 4 
XIV-10 

XIV-10 
XIV-1 6 
XIV-17 
XIV- 19 
XIV-21 
XIV-21 
XIV-22 
XIV-23 
XIV-26 
XIV-28 

XIV- 30 

XIV- 30 
XIV- 31 
XIV- j2 
XIV- 32 
XIV- 34 
XIV- 37 

XIV-39 

XIV-40 



XV. WEISS, s. F. 

"Template Analysis in a Conversational system" 



0 




13 






i - ii nr 



TABLE OF CONTENTS (continued) 



XV. continued 



Page 



Abstract XV-1 

1. Motivation XV-1 

2. Some Existing Conversational Systems. .... XV-4 

3. Goals for a Proposed Conversational System. • . XV- 7 

4. Implementation of the Conventional System • . . XV-11 

A) Capabilities XV-11 

B) Input Conventions ......... XV-12 

C) The Structure of the Process. ..... XV-13 

D) Template Analysis in theGtnversational System XV-14 

E) The Guide Facility XV-23 

F) Tutorials XV-24 

5. Experimentation XV-25 

A) System Performance XV-30 

B) User Performance. . XV-31 

C) Timing XV- 34 

6. Future Extensions •••••...*•* XV-35 

7. Conclusion XV- 37 

References XV- 39 




14 



xiv 



ERIC User Please Note: 

This summary discusses all 5 parts of Information Storage 
and Retrieval (ISR-18), which is available in its entirety as 
LI 002 719. Only the papers from Part Two are reproduced here 
as A 002 721. Sea LI 002 720 for Part One and LI 002 722 thru 
LI 002 724 for Parts 3-5. 

Summary 

The present report is the eighteenth in a series describing research 
in automatic information storage and retrieval conducted by the Department 
of Computer Science at Cornell University. The report covering work carried 
out by the SMART project for approximately one year (summer 1959 to summer 
1970) is separated into five parts: automatic content analysis (Sections 

I to IV) , automatic dictionary construction (Sections V to VII) , user feed- 
back procedures (Sections VITI to XI) , document and query clustering methods 
(Sections XII and yilT) , and SMAR1 systers design for on-line operations 
(Sections XIV and XV) . 

Most recipients of SMART pr Oj :ct reports will experience a gap in 
the series of scientific reports received to date. Report ISR-17, consisting 
of a master’s thesis by Thomas Brauen entitled ’’Document Vector Modification 
in On-line Information Retrieval Systems” was prepared for limited distribu- 
tion during the fall of 1959. Report ISR-17 is available from che National 
Technical Information Service in Springfield, Virginia 22151, under order 
n^unber PB 186-135% 

The SMART system continues to operate in a batch processing mode 
on the IBM 350 irod-O 65 system at Co»nell University. The standard processing 
mode is eventually to be replaced h> an on-line system using time-shared 
console devices for input and output. The overall design for such an on-line 
version of SMART har, been completed, and is described in Secti ->r XIV of the 
present report. *hile awaiting the time-sharing implementation of the 
system, new retrieval experiments have been p^riorred using larger document 
collections within the existing system, attempts to compare Lhc performance 
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of several collections of different sizes must take into account the 
collection "generality". A study of this problem is made in Section II of 
the present report. Of special interest may also be the new procedures 
for the automatic recognition of "common" words in English texts (Section 
VI), and the automatic construction of thesauruses and dictionaries for use 
in an automatic language analysis system (Section VII) . Finally, a new 
inexpensive method of document classification and term grouping is 
described and evaluated in Section XII of the present report. 

Sections I to IV cover experiments in automatic content analysis 
and automatic indexing. Section I by S. F. Keiss contains the results of 
experiments , using statistical ^nd syntactic procedures for the automatic 
recognition of phrases in written texts. It is shown once again that be- 
cause of the relative heterogeneity of most document collections# and 
the sparseness of the document space, phrases are not normally needed 
for content identification. 

In Section II by G. Salton, the "generality" problem is examined 
which arises when two or more distinct collections are compared in a 
retrieval environment. It is shown that proportionately fewer nonrelevant 
items tend to be retrieved when larger collections (of low generality) 
are used, than when small, high generality collections serve for evaluation 
purposes. The systems viewpoint thus normally favors the larger, low 
generality output, whereas the user viewpoint prefers the performance of 
the smaller collection. ■ “ 

The effectiveness of bibliographic citations for content analysis 
purposes is examined in Section III by G. Salton. It is shown that in 
some situations when the citation space is reasonably dense, the usl of 
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citations attached to documents is even more effective than die use of 
standard keywords or descriptors. In any case, citations should be added 
to the normal descriptors whenever they happen to be available 

In the last section of part 1, certain template analysis methods 

t 

are applied to the automatic resolution of ambiguous constructions 
(Section IV by S, F. Weiss) . It is shown that a set of contextual rules 
can be constructed by a semi-automatic learning process, which will eventually 
lead to an automatic recognition of over ninety percent of the existing 
textual ambiguities. 

Part 2, consisting of Secticns V, VI and VII covers procedures 
for the automatic construction of dictionaries and thesauruses useful in 
text analysis systems. In Section V by D. Bergmark it is shown tuat word 
stem methods using large common word lists are more effective in an infor- 
mation retrieval environment that some manually constructed thesauruses , 
even though the latter also include synonym recognition facilities. 

A new model for the automatic determination of "common" words 
(which are not to be used for concent identification) is proposed and 
evaluated in Section VI by K. Bonwit and J. Aste-Tonsmann. The resulting 
process can be incorporated into fully automatic dictionary construction 
systems. The complete thesaurus construction problem is reviewed in Section 

VII by G. Salton, and the effectiveness of a variety of automatic dictionaries 
is evaluated. 

Part 3, consisting of Sections VIII through XI, deals with a 
number of refinements of the normal relevance feedback process which has 
been examined in a number of previous reports in this scries. In Section 

VIII by T. P. Baker, a query splitting process is evaluated in which input 
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queries are split into two or more parts during feedback whenever the 
relevant documents identified by the user are separated by one or more non- 
relevant ones. 

The effectiveness of relevance feedback techniques in an environ- 
ment of variable generality is examined in Section IX by B. Capps and M. 

Yin. It is shown that some of the feedback techniques are equally applica- 
ble to collections of small and large generality. Techniques of negative 
feedback (when no relevant items are identified by the users, but only 
nonrelevant ones) are considered in Section X by M. Kerchner. It is shown 
that a number of selective negative techniques, in which only certain 
specific concepts are actually modified during the feedback process, briny 
good improvements in retrieval effectiveness over the standard nonselective 
methods . 

Finally, a new feedback methodology in which a number of documents 
jointly identified as relevant to earlier queries are used as a set for 
relevance feedback purposes is proposed and evaluated in Section XI by L. 
Paavola. 

Two new clustering techniques are examined in Part 3 of this report, 
consisting of Sections XII and XIII. A controlled, inexpensive, single-pass 
clustering algorithm is described and evaluated in Section XII by D. B. 
Johnson and J. M. Lafuente. In this clustering method, each document is 
examined oniy or.ee, and the procedure is shown to be equivalent in certain 

circurs tances to other more demanding clustering procedures. 

\ 

The query clustering process, in which query groups are used to 
define the information search strategy is studied in Section XIII by S. 
Worona. A variety of parameter values is evaluated in a retrieval environ- 
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rent to be used for cluster generation, centroid definition, and final 
search strategy. 

The last part, number five, consisting of Sections XIV and XV, 
covers the design of on-line information retrieval systems. A new 
SMART system design for on-line use is proposed in Section XIV by D. and 
R. Williamson, based on the concepts of* pseudo-batching and the interaction 
of a cycling program with a console monitor. The user interface and 
conversational facilities are also described. 

A template analysis technique is used in Section XV by S. F. Weiss 
for the implementation of conversational retrieval systems used in a time- 
sharing environment. The effectiveness of the method is discussed, as 
well as its implementation in a retrieval situation. 

Additional automatic content analysis and search procedures used 
with the SMART system are described in several previous reports in th5 s 
series, including notably reports ISR-11 to ISR-16 published between 1966 
and 1969. These reports are all available from the National Technical 
Information Service in Springfield, Virginia. 
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V. The Effect of Common Words and 
Synonyms on Retrieval Performance 

D. Bergmark 



Abstract 

The effect of removing common vords from document and query vectors 
is investigated, using the Cran-200 collection. The method used is com- 
parison of a standard stem dictionary and a thesaurus with a new dictionary 
formed by adding an extensive common word list to the standard stem dic- 
tionary. It is found that removal of common words from the query and docu- 
ment vectors significantly increases precision. Query ani document vectors 
without either common words or synonyms yield the highest precision results 
but inferior recall rssults. Synonyms are found to be more effective for 
recall than common words. 

1. Introduction 

A thesaurus results in about 10% better retrieval than a standard 
stem dictionary, according to results in previous studies (2). This fact 
leads to the question of why the thesaurus performs better: is it because 

it groups terms into synonym classes, or is it because the thesaurus in- 
cludes a large common word list. If both contribute to the superiority of 
the thesaurus, then it is desirable to determine what proportion of this 
improvement is due to each factor. Taking common wcrds out of a thesaurus 
could consume little time compared to that required for grouping concepts 
into synonym classes if an appropriate means of automatically generating 
the common word list were found. Therefore, if a large amount of improve- 
ment of a thesaurus over the stem dictionary is due to removing common 
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words and putting them in a separate list, then it would he advantageous to 
devote work to methods of isolating the insignificant words. 

The subject of this paper, then, is a comparison of the search re- 
sults cf a standard stem dictionary, a thesaurus, and a standard stem dic- 
tionary with an extensive common word list. The results of this study indicate 
that a large amount of the difference in retrieval performance between thesaurus 
and standard stem dictionaries is due to the removal of common words into a 
separate list. Surprisingly, the effect of synonyms and of common words are 
similar; both encourage higher recall but both degrade precision. 

2. Experiment Outline 

A) The Experimental Data Base 

With limited resources, it is faii n ly important to chcose carefully the 
collection tc be studied. First, the collection must be snail enough to be 
manageable within the resources available, yet large enough to give signifi- 
cant results. The collection also has to have both a thesaurus and a word stem 
dictionary available. 

The Cran-200 collection seems to satisfy these criteria and is chosen 
as the basis for the study. This collection has 200 documents and 42 queries, 
and the text is available on tape for lockup with a new dictionary. 

B) Creation of the Significant Stem Dictionary 

Investigating the retrieval effectiveness of an extensive common word 
list together with a standard stem dictionary requires, per force, the genera- 
tion of a new dictionary* Specifically, the new dictionary desired is one which 
has the same stems as the standard stem dictionary but with many more words 
marked as common. 

o 
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The most readily available common word list for the Cran-200 collec- 
tion is contained in the Cran-200 thesaurus. In fact, the thesaurus is 
essentially the same dictionary as the standard stem dictionary except that 
many more words are flagged as common, and synonyms are grouped into concept 
classes by assignment of the same concept number to all word stems synonymous 
with each other. Furthermore , since the same word may occur in more than one 
concept class, one term may have more than one concept number assignee to It. 

Thus more "signif icance" decisions are made in constructing a 
thesaurus than in constructing a standard stem dictionary, both in removing 
common and in removing infrequently used words from the dictionary list. 

Hence if a thesaurus is turned back into a standard stem dictionary, the 
result is a standard stem dictionary with a large common word list. There- 
fore, rather than going through the standard stem dictionary and marking 
additional words as common, the strategy followed in this experiment is to go 
through the thesaurus and renumber the words su that the common words are 
still flagged as common, but the stems are separated so that no two stems 
have the same concept number and each stem has only one concept number. 

This method is efficient since no word-matching need be done to determine 
which are common words and which are not. 

Punching the Cran-200 thesaurus, CRIMES, from Tape 9 onto cards 
yields approximately 3380 cards with one thesaurus term per card along with 
its concept class(es). These cards are then used as input to a 360/20 RPC 
program which punches a duplicate deck in which each thesaurus tern is 
assigned a unique concept number, with numbering starting at 1 for the 
significant terms and at 32001 for common terms. This results in 
significant, distinct words and 7^1 distinct common wor -s. 
r% That the resulting dictionary (henceforth referred to as the 
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"significant stem dictionary") is the one desired can be seen from Appendix 
I, which lists some typical query vectors using each of the three dictionaries. 

It can be seen that the significant and standard stem queries are sufficiently 
similar except for the inclusion of common words in the standard stem queries.* 
The significant stem dictionary has approximately twice as many words marked 
as common than does the standard stem dictionary. In addition, the significant 
stem dictionary has about 65% as many significant concepts as the standard, and 
many of the remainder are actually common and so were never included, or were 
deleted from, the thesaurus. The new dictionary thus has the same word signif- 
icance decisions (i.e., the same common word list) as the thesaurus, but the 
same grouping decisions (i.e., none) as the word stem dictionary. 

C) Generation of New Query and Document Vectors 

V/ith the creation of the new dictionary, it is necessary to reassign 
vectors for the queries and documents of the Cran-200 collection in preparation 
for search runs. To accomplish this task the LOOKUP program, written in FL/I, 
is used. This program reads in a dictionary, a suffix list, and the queiy or 
document texts; it then generates concept vectors for the texts using the standard 
suffixing rules. It is run once for the queries and once for the documents. 

Some decision has to be made concerning the suffix list; ideally it 
should be as close as possible to that used for creating the original thesaurus 
and standard stem vectors for the Cran-200 collection. The suffix list used in 
this study contains approximately 19'} terms, and the resulting vectors indicate 
that it is quite similar to the one used to generate thesaurus and standard stem 
vectors . 



"There was some concern in the early stages of this work that the thesaurus con- 
” s many full words rather than stems. Although there are lull words in the 
1 |Mr aurus which are only stems in the stem dictionary, the reverse is also true. 
Li\l^>ny case, analysis of individual queries shows that these discrepancies have 

f mi f leant effect on wh>; is retrieved. 



As far as the Cran-200 text is concerned, it has to be picked out from 
the Cran-1400 collection* A slight modification of the LOOKUP program does 
this by allowing the user to specify which of the Cran-1400 query and docu- 
ment texts are to be processed* One Cran-200 text (Text 995) is not on the 
Cran-1400 tape but is fortunately not relevant to any of the Cran-200 queries; 
it is not believed that the missing document perturbs results very much* 

The average length of the resulting significant stem queries is 6,14 
words as opposed to the standard stem queries with 8.26 words and the thesaurus 
queries with 6.98 words. The size of the document vectors varies proportion- 
ally with the length of the queries, except that the thesaurus document 
vectors are in general slightly shorter than the significant stem document 
vectors . 

Why there are more words in the thesaurus queries than in the signif- 
icant stem queries is somewhat unclear. As can be seen from f he queries listed 
in Appendix I the additional words in the thesaurus queries are common ones; 
these Wo x'd s have been removed from the thesaurus, probably because they were 
judged to be common, and thus do not appear in the significant stem queries. 

On the other hand, some thesaurus queries have fewer significant terms than 
the significant stem queries; this is because if two words ai? synonymous, 
their concept number appears only once in the thesaurus query with a heavier 
weight . 



D) Document Analysis — Search and Average Runs 

In order that the evaluation of all three dictionaries is nr a ccn- 
sistent basis, search runs must be done using vectors generated w'th all three 
dictionaries. Relevancy judgments must be added to the significant stem 
query vectors obtained by LOOKUP so that the same xelevancy judgments are usee 



for each of the three sets of queries. A fairly simple search without complex 
parameters is performed so that unnecessary complications in analysis do not 
arise. A full search lists the top thirty documents, and then a positive feed- 
back search using the tcp five documents is done to make sure that removing 
common words and synonyms does not have an unforseen effect on feedbacK. 

The results of the three searches, thesaurus, significant stem and 
standard stem, are compared by analysis of overall measures as well as in-depth 
analysis of individual queries to see to what extent not having synonyms hurt 
or help the retrieval process. Similarly, in-depth analysis is required to 
see what effect common words, or lack of them, have on retrieval. 

To aid the analysis, the standard averages are obtained as well as 
the recall-level and document-level recall-precision graphs. The three full 
searches are compared with each other, and the throe feedback runs are compared 
with each other. Results are verified using the standard significance tests. 

In addition, some statistics are calculated by hand to determine 
retrieval effectiveness. Specifically, it is felt that the default rank recall 
measure provided in the SMART averaging routines is not quite suited to the 
analysis being done here. When some of the relevant documents do not have any 
correlation with the query, the averages have to be based *>n extrapolation ; in th 
standard SMART run, the rank recall is calculated assuming that the relevant docu 
ments with no correlation appear at the bottom of the list (i.f;., rank 200, 199, 
198,...). Since this project :s directed toward seeing what effect common words 
have on precision as well as recall, it seens better to take into account the 
number of documents, relevant and non-relevant , which correlate with the query 
in the first place. That is, it seems that if cue is testing precision, and 
if two queries each retrieve six out of nine relevant documents, but cne of 
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them recovei's thirty more non-re Levant documents than the other b zfore going 
on to a zero correlation, it should be judged less precise than the other. 
Thus in the graphs derived by hand, rank recall is extrapolated on the basis 
of CORR.RANK+1, CORR.RANK+2, etc. for the relevant documents which have 
zero correlation with the query. 

All in-depth analysis is performed cn the full search results rather 
than on feedback results because the project is more concerned with deter- 
mining the effect of dictionaries rather than the effect of feedback on 
z’etrieval. The z^ecall-precision graphs for the three feedback runs are, 
however, included in Ap h ?ndix II. 

3. Retrieval Performance Results 

A) Significant vs. Standard Stem Dictionary 

The results of this experiment show that, as expected, use of a 
large common word list improves the retrieval performance of a standard 
stem dictionary* It can be seem fro w Graphs 1 and 2, which show the recall 
and precision averages for two full searches, one using the standard stem 
dictionary and the other using the significant stem dictionary, that the 
significant stem dictionary results in greater precision at all recall and 
document levels. 

Furthermore, statistics for these runs bear out the same 

conclusion, that the significant stem performs better than tne standard stem: 

Stem Significant Stem 
4 . 3331 

2 .SOS 3 

U 
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Rank Re_all .242 



Log Precision 
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The above statistics are significant according to all the usual significance 
tests. 

It is interesting to note that the difference between the signifi- 
cant arid standard stem curves remains fairly constant despite the recall 
or document level. This indicates that the significant stem performs roughly 
the same retrieval as the standard stem, only more precisely. In other 
words, including common terms in the document and query sectors seems to 
uniformly degrade precision performance. 

B) Significant Stem vs. Thesaurus 

It was originally expected ttat using a standard stem dictionary 
with a large, common word list would result in search performance better 
than the standard stem but not as good as the thesaurus. From the recall- 
precisi * Craphs 3 and 4 it car be seen that contrary to these expectations 
the significant stem performs just as well as the thesaurus, if net better. 

The similarity of the significant stem and thesaurus curves is 
confirmed by global statistics, which while extremely close give a slight 
edge to the significant stem dictionary: 

Significant Stem Thesaurus 
Rank Recall .3331 .3222 

Log Precision .5053 .4880 

Here the difference between the two curves is not the same. The 
significant stem performs better than the thesaurus at the low end of the 
curve* but loses this edge as recall increases. One may conclude that the 
standard stem queries find only the first few relevant documents faster than 
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the thesaurus. 

C) Standard Stem vs. Thesaurus 

In general a thesaurus results in better retrieval performance than 
a standard stem dictionary, and this experiment has roughly the sane 
appearance. Recall-Precision Graphs 5 and 6 indicate the superiority of 
;he thesaurus over the standard stem at all recall and document levels, 
with the superiority most marked at high recall levels. That the thesaurus, 
with its common word list and synonyms, is better than the standard stem 
but is approximately equal to the significant stem, with only a common word 
list, indicates that much of the improvement of the thesaurus over the 
standard stem is due to the common word list. Furthermore, comparison of 
these three sets of recall-precis ion plots seems to indicate that at the 
low recall end synonyms actually degrade precision, acting as common words do. 

D) Recall Results 

The difficulty with the significant stem dictionary, however, can 
be detected in the normalized global statistics (Figure l). 





Standard Stem 


Significant Stem 


Thesaurus 


Norn Recall 


.8469 


.8330 


.8732 


Norm Precision 


.6615 


.6918 


.6924 



Normal Recall and Frecision for Full Search, All Dictionaries 

Figure 1 

These global statistics are much closer than the Rank Recall and 
Log Frecision and indeed, the first favors the standard stem dictionary over 
q the significant stem although neither are significantly different a-cordirg 
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to the t-test. The problem displayed here is that the significart stem 
ultimately results in lower recall than does the standard stem; more 
queries have rank and precision measures based on extrapolation in the first 
case than in the second. 

To be specific, 14 of the 42 queries using the significant stem 
dictionary do not have a 1.00 recall ceiling during the full search, while 
only nine of the standard stem and six of the thesaurus do not. The average 
recall veiling for the significant stem is 0.8853 while the average recall 
ceiling for the standard stem is 0.9390 and0.95b5 for the thesaurus. After 
feedback, however, the difference narrows somewhat, going to 0.9504 for the 
significant stem dictionary and 0.9841 for the standard stem dictionary 
(the ttiesaurus at 0.9814 after feedback is not quite as good as the standard 
stem dictionary). 

It is reasonable that the recall ceiling is higher for t 1 ? standard 
stem than for the significant stem, since the average que^y length for the 
latter is greater than that for the former. Thus chances foi a significant 
stem query not correlating at all with documents relevant to it are greater 
than those fcr a standard stem query. Similarly synonyms improve the chances 
fcr the thesaurus query’s matching at lea~+ one relevant document. 

To measure this recall difference In another way, Figure 2 displays 
a recall measure used by Keene [2] based on the average rank of the lest 
relevant document retrieved. Figure ? is based on the full search results. 

The method 1 averages, which measure ultimate recall ability, shews 
that the thesaurus is superior in this respect, while the significant stem 
dictionary has the poorest recall. The method 2 averages , however, which 
are more a measure of precision in that they also include a measure of hew 
f^ny non- relevant documents are retrieved before correlation goes to zero, 
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Dictionary 


Method 1 


Method 2 


Standard Stem 


83. 33 


60.29 


Significant Stem 


87.64 


^6 • 45 


Thesaurus 


73.24 


57.57 



Method 1: Unrecovered relevant documents assigned ranks of 200, 199, 

etc. 

Method <?: Unrecovered relevant documents assigned ranks of CORR. RANK+1 , 

CORR. RANK+2 , etc* where CORR. RANK is the rank of the 
documents with the lowest correlation with the query greater 
than 0. 

i 



Average Rank of the Last Relevant Document 
Figure 2 
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puv the significant stem at the top of the list. Thus these averages 
reinforce the previous hypothesis that if the user wants to recover every 
last relevant document, he should use the thesaurus, and if instead he is 
interested in minimizing the number of non-relevant retrieved, he should 
use the significant stem dictionary. 

E) Effect of "Query Wordiness 11 on Search Performance 

While it seems clear that sirnif icant stem results in an overall 
increase in precision over standard stem queries, it seems likely that the 
’’wordiness" of a query, cr the number of common words included in the 
standard stem query not included in the signify’ 4 ^ n query, should have 

some effect on retrieval. That is, the more v<‘ . standard stem query 

is, the more non-rele\ ant documents should be r \ before all the rele- 

vant ones. Graph 7 shows the rank recall averae- K: 1 standard and signifi- 
cant stems, over all 42 queries, at various lew ± rdiness". 

It is not really clear that retrieval d' c: . ■. ; ■: 3 faster as more 
and more common words are added to the query. / . of possible explana- 
tions for this are 1) all the common words tege* etrieve the same 

documents, since the common words in a given qu * be "related", or 
2) of the common words added, only one or two of + re responsible i or 

retrieving garbage, (The latter theory seems t r firmed by study of 

individual queries.) The left part of the gra; 1 course identical for 

both diccionaries since at that point the querl* ax practically identical. 




D Effect of Query Length cn b'carch Fei . r~ir.ee 

It also seems likely that the differer . c ’n p-r formin'- e would vary 
depending on the number of significant concept i. \x query. Tor example, 
if the* significant stem query is very explicit, c many significant 
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Length of Query vb. Ronk Recoil 
Groph 8 



concepts in it, then the added common words in the standard stem query should 
result in extremely precise retrieval. On the ether hand, a very short query 
in terms cf significant concepts would, cne supposes, almost have to contain 
common words if any documents are to be retrieved at all. This hypothesis, 
however, is not born out by the search results. Graph 8 plots rank recall 
for the significant and standard stem queries at various query lengths over 
M2 queries. 

Graph 8 indicates that there are indeed differences in the improvement 
of significant stem over standard stem queries, but there is no easy way to 
character i"e the difierences. There are other factors affecting retrieval , 
such as the number of documents relevant to the query. For example, with a 
very short query and few relevant documents, common words would be more 
necessary than if there are a lot of relevant documents. Thus the only fact 
shown by Graph 8 is that retrieval can vary with the length of the query; the 
Dest recall occurs at the average number of significant concepts, which is 
roughly six, 

G) Effect of Query Generality on Search Performance 

Remaining is the question of whether' it is wise to forget about using 
a thesaurus with synonyms, since removing common words alone improves stem 
retrieval. Certainly the recall-precision graphs indicate that precision 
suffers with the thesaurus, particularly at low recall and document levels. 

In many cases, then, it appears that synonyms retrieve more non-relevant 
documents than a dictionary without synonyms. 

Graph 9, however, indicates that the picture for the thesaurus is 
not all that black. This graph shows, for all three dictionaries, rank recall 
plotted against the number of documents relevant to the query, holding query 
length constant; when query generality is l r w, the thecaurus performs best, 



V- 18 



Ronk 

Recoil 




# of Documents Relevont 



Ronk Recall vs. # Documents Relevont 
(Queries with 6 Significant Concepts) 




Graph 9 



Using a thesaurus improves the chances of those one or two relevant documents 
being retrieved, whereas the signficant stem query may fail ro correlate 
with any of the relevant documents. When there are many relevant documents, 
however, a thesaurus loses its edge because at least one of the relevant 
documents is likely to be retrieved by any of the queries, and the thesaurus 
synonyms serve only to retrieve a large amount of non-reievant items. 

H) Conclusions of the Global Analysis 

The general conclusions which may be drawn from this global analysis 
are as follows: 

I) If one is interested in precision, it is definitely wise to 
remove common words from the query and document vectors. 

2) Jf one is interested in a high recall ceiling during a full 
search, one should use a thesaurus. The thesaurus has Letter 
ultimate recall than does stem alone, indicating that synonyms 
retrieve better than common words dc. 

3) If there are few documents relevant to a query, one should use 

a thesaurus. Keen reaches much the same conclusion, saying that 
u for users needing high precision with only one or two relevant 
documents, the thesaurus is little better + han stem on IRE-3, 
but in CRAN-1 and ADI, a larger superiority for the thesaurus 
is evident." (2| (CRA.7-1 is the same collection as is being used 
here.) It is possible that while synonyms are useful in the 
Cran-200 and ADI collections, in other collections synonyms 
would not be required even for high recall. 

h) If there are many relevant documents to a query, it is just 
as good and perhaps better to remove both common words and 
synonyms irccr. the query and document vectors. 
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4. Analysis of Search Performance 

Having reached some conclusions on the basis of overall statistics, it 
is now appropriate to examine the reasons for these results by looking at some 
specific queries. 

The overall averages presented in section 3 indicate the general superi- 
ority of the significant stem dictionary over the standard stem dictionary . A t 
all recall (and document) levels, the significant stem has greater precision than 
does the standard stem. The reason for this improvement in performance can be 
seen by inspection of Query 36 (Fip,ure 3). 



Relevant 


T _ ~ 

Standard Stem 


Significant Stem 


Thesaurus 


Document # 


Rank 6 Coir. 


Rank 6 Corr. 


Rank 6 Corr, 


37 


1 .4234 
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. 5292 


1 .4889 


35 


2 .2413 
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. 3111 


2 .3651 


36 


7 .1365 
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. 2046 


6 .2614 


34 


14 .1064 
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.1519 


5 .2505 


Rank Sum 


. 4167 


.8333 


. 7143 


Log Precision 


.4503 


. 8615 


. 7762 


Norm Recall 


.*941 


. 0974 


.9949 


Norm Precision 


.7843 


.9716 


.9493 



Query 36 
Figure 3 



The standard stem query has two more terms in it than does the significant stem 
query, "deter* -ine" and "establish," It can be seen from Figure 3 that removal 
of these two common words from, the query doubles search effeetiv- : . 

All three queries retrieve documents 3b and 37 firsts the standard 
query, however* retrieves four non -relevant documents before the third relevant 
one. Two of these non relevant documents are retrieved by the query word 
"determine" while the other two are retrieved simply because they are short and 
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contain one query term each. 

Analysis of this query demonstrates tv;o reasons v;hy removing common 
words is beneficial to retrieval. re is that common words increase the 
chances of the query's correlating with a non -relevant document simply 
because that document and the query have the same common words in then. 
Seco'dly, inclusion of common words greatly increases :he length of the 
document vectors, but short texts are lengthened relatively less than are 
long texts. Thus shor* texts have a decidedlv greater chance of a high 
correlation with the query; having one tern in common with the query gives it 
a '.iisproportionately high correlation when relevancy should not be a function 
of text length. 

Also indicated by the recall-precision curves is the similarity of 
the significant stem and thesaurus retrieval, with the signi f icant being 
slightly better in general. This finding is also borne cut by Query 36 
(Figure 3), where only two non-relevant documents are retrieved by the? 
thesaurus query, as opposed to the one retrieved by the significant stoi 
query, before a recall level of 1.00 is reached. Interestingly, the short 
document containing the terms "axial compressor" which was retrieved early 
by both the stem queries is not one of these two non-relevant document? 
retrieved early by the thesaurus query; rather, synonyms account for th 
retrieval of the two non-relevant items. Specifically, the query term 
"compressor" appears only once in the two non-relevant documents, while the 
synonym "impeller" appears seventeen times, giving them a high correlation 
with the thesaurus query. 

Query 36 thus demonstrates why synonyms can degrule precision; 
"compressor" is a frequently occurring word in the Cran-200 collection ani 
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in combination with its synonyms can cause retrieval of a number of non- 
relevant documents. Using stems alone, on the other hand, gives less 
emphasis to words like "compressor" and more to the group of significant 
query terms as a whole. 

Nevertheless, it is difficult to male hard and fast distinctions be- 
tween the search precision of thesaurus queries versus significant stem 
queries. In Query 27 (Figure 4), for example, it is precisely the synonyms 
which account for the high jerformance of the thesaurus query. All three 
versions of Query 27 are identical, except that the thesaurus query, of 
course, includes synonyms. These synonyms serve to retrieve with relatively 
high precision the first three relevant documents. Specifically, document 
160 does not contain the term "boundary-layer" but it dees contain its 
synonyms "boundary" and "layer" three times each. In this case, the low 
precision effect of synonyms is offset by the large set of querv terms; 
taken as a whole, the complete set of query terms and their synonyms helps 
pinpoint the relevant documents more aceura'ely. 
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The superior corr • i^tjon of relevant items 28 and 56 with the 
thesaurus query as oppcsed to -he stem queries is explained by the shorter 
thesaurus document vector lengths (Figure 5). 



Document 


Thesaurus Length 


Significant Stem Length 


28 


57 


63 


56 


26 


27 



Length of Relevant Document Vectors for Query 27 
Figure 5 



Similarly, the significant stem is more precise than the standard stem 
because significant stem document vectors are shorter, giving higher weights 
to their significant terms. 

Search results in this study corroborate the findings of past 
workers that the thesaurus is better than the standard stem dictionaries . 

The results also indicate that much of this difference may well be attribut- 
able to the lengthy common .vord list of the thesaurus. In Query 36 (Figure 
3), for example, the improvement of the thesaurus query over the standard 
stem query is due more to the removal of common words than to synonyms. 

The same improvement can be seen in Query 7 (Figure 6) where the 
thesaurus results in much tetter retrieval than the standard stem query » 

All three queries retrieve the same two relevant and the same non-relevant 
documents in the first thtea recovered. After that, however, the next 
relevant document is found in ranks 11, 1 'S ani 4i i r the significant 
stem, thesaurus, and standard stem queries, respectively. This difference 
ir, retrieval is clearly due to the removal of common words, since the two 
q " ;tionaries with the long common word list ranked about the same. Synenyrs 
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Query 7 
Figure 6 



contribute very little to the high precision in the initial retrieval stages. 

Results indicate, however, that at the higher recall levels, the 
thesaurus is superior . This is shown in Query 7 (Figure 6) where the last two 
relevant documents are retrieved much faster by the thesaurus query then by 
either of the two stem queries* The reason for this is primarily the shorter 
document lengths of the thesaurus vectors, and secondarily the synonym 
"coef ficient M is matched with the query term "deri vat ive ,T in one case. 

{Shorter document length also explains the faster retrieval of 72 by the 
significant stem tnan by the standard stem.) In the case of document 95, 
however, the standard dictionary works better than the significant stem 
dictionary because the common terms "comparison" and "number" corJbined with 
the significant "mach" boost the document -query correlation of 95.) 

That the significant stem dictionary has severe short-corings in the 
lower correlation, high recall, ranges is without doubt. This degradation in 
recall is not fully reflected by the recall-precision graphs, though it is 
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seen in the normalized global statistics (Figure 1). 

The main explanation for this phenomenon appears to be that the 
significant stem vectors, with neither common words nor synonyms in them, 
have a good chance of "missing" a relevant document altogether. Query 23 
(Figure 7) demonstrates this in that one of the two relevant documents does 
not correlate at all with the significant stem query. 
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Query 23 
Figure 7 



In this query, Item 148 has none of the significant query terms. It 
does, however, contain the synonyms "impeller" and "Compressor" for the query 
term "pump," and it also contains "method," a common term fcund in tne stan- 
dard stem query. (It should be noted that Document 148 is picked up after 
feedbc^ v for the significant stem query.) 

While both common woids and synonyms are useful fcr retrieval at 
high recall levels s synonyms are superior in this respect. In Query 3 
(Figure R) the thesaurus is the only dictionary of the three which a:hicves 
100" recall during the full search. 
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Query 3 



Figure 8 



The only reason that document 33 is retrieved by the thesaurus is 
that it contains the term "high-pressure-ratio" which matches "pressure" in 
the thesaurus query. Even the five extra terms added to the standard stem 
dictionary query fail to retrieve this last relevant item. 

It is interesting to note here that while recall is superior for the 
thesaurus in Query 3, precision ic not. The synonyms, as noted above, retrieve 
many non-re levant documents, and here more so than even common words do. 

Once again, the rule that high recall means low precision seems to be borne out. 

Although the significant stem fails to achieve a 100% recall celling 
more often than both the other dictionaries, there are cases when high precision, 
low recall, and feedback can be effectively used to achieve high precision 
and high recall. One case of this is Query 1 (Figure 9) where so many non- 
relevant items are retrieved by the thesaurus and the standard stem that feed- 
back is impossible because the user sees no relevant documents. Cnee again, as 
is typically the case, the thesaurus has the highest recall ceiling but not 
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very precise retrieval. 
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er feedback 
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Query 1 
Figure 9 

The significant stem query retrieves only one of the three relevant items 
(22), but this item is used for positive feedback and in turn retrieves 
another relevant document (21). No feedback, on the other hand, can be done 
with the standard stem query (only 22 correlates, and it is in rank 29} or 
with the thesaurus query (two relevant documents correlate with the query, 
but are in ranks 32 and 33). Thus query 1 demonstrates that it is not always 
necessary to have complete recall, at least during the initial search; high 
precision is more useful if feedback is going to be used. 

The feedback recall-precision graphs in Appendix II indicate that 
this is precisely what happens, since feedback improves the precision of 
the significant stem much mere than the other two dictionaries at the hign 
recall end of the curve. 

The ef fect of query length on precision , where length is the number 
of significant concepts in the query vectois, dees not appear to vary 
retrieval results in a consistent manner. If a query is v/orded very 
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specifically, which dictionary used is immaterial (see Query 12, Figure 10). 
On the other hand, a lengthy query may zero in faster on relevant documents 
but in the long run retrieves more non- relevant ones. 
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Figure 10 
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Figure 11 



It seems obvious, then, that an extensive common word list is 
helpful in retrieval , particularly if precision is desired. If one wishes 
to improve upon a standard sten dictionary, the first thing he should do 
is to find a good, extensive common word list. After that, additional 
improvement nay be gained (in recall, particularly) by grouping some of the 
dictionary terms into concept clssscs. Doing it the other way around car. 
be disastrous, however, as is seen in Qu*_ ry 19 (Figure 12). 
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The significant stem dictionary here is clearly the best and the 
thesaurus is the worst. In Query 19, there are eight significant terms 
which in themselves result ill good retrieval (as indicated by the perfor- 
mance of the significant stem query). In addition to these eight terms, 
there are five common terms in the standard stem query, causing it to 
retrieve five non-relevant items before the first relevant one. Figure 
13 shows how the significant terms can be overwhelmed by insignificant terms. 
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Terms (and Number of Occurrences) Appearing in Top 6 
Documents Retrieved by Standard Stem Query 19 

Figure 13 



The thesaurus query vector for some reason contains three of the 
common terms added to Query 19; it does worse than the stem dictionaries 

because synonyms compound the difficulties of common words. The thesaurus 
query thus retrieves 14 non-relevant documents before finding the first 
relevant one. The query terms "oscillater’ ani r, planforn ,T both belong to 
relatively large synonym classes. 

o 
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5. Conclusions 

The main conclusion of this study in the area of dictionary construc- 
tion is that careful construction of common word l’sts is at least as 
important as grouping concepts into synonym classes. This is an important 
result since it should be earier to construct common word lists automatically 
than to construct synonym classes automatically.* 

This study, in addition, has relevance to areas other than dictionary 
construction. For example, a fair amount of work is being done in the area 
of automatic document vector modification, which in part involves dropping 
"unimportant" concepts from the vectors (i.e., concepts infrequently used 
in queries). Since the common word list used in this study also contains 
infrequent words whereas the standard stem dictionary merely includes them 
as regular words, there is an opportunity in local analysis of these search 
runs to determine the effect of infrequently used words on retrieval* In 
particular Doth Query 6 and Query 1 in some of their version 3 included an 
infrequent word not in the other versions. In neither case, did this infre- 
quent word affect retrieval except lower correlations by lengthening the 
query vector. 

Another area in which this study is relevant is in scatter storage 
schemes for dictionary lookups [3]. This scheme can offer improvements in 
efficiency but thesaurus -type dictionaries are difficult to handle. One 
has to make a two-step mapping in order to get to the synonyn class from 
the original query or document tern; com, ‘.on words, on the other hand, can 



* Work is being done in automatic synonym construction or has teen done ( 1 ] . 
For these algorithms to work, however, common words probably have to be 
^ removed first, anyway. 
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be handled easily enough. Therefore having determined that a stardard stem 
dictionary can be considerably improved by removing some words into the 
common word list, it would be better to implement this improvement in the 
storage scatter scheme than it would be to implement the improvement in- 
volving concept classes. 

Finally, this project carries out a suggestion made by Keen \2 ] 
that is the "five rules" of thesaurus construction are to be really evaluated, 
several different versions of a single dictionary would have to be made and 
tested. In the course of this study, a new dictionary i?s created, one 
which uses the frequency rules but not the grouping rules. Thus the impor- 
tance of rules dealing with word frequency versus rules about synonym classes 
is established. It is just as important to be careful in constructing :he 
common word list as in cons :ructing the thesaurus. However, it is probably 
easier to follow the rules for common work list construction since common 
words are more systematic than synonyms are. 

6. Further Studies 

This investigation raises a few issues which were not settled, and 
which may prove interesting for further study: 

1) The work presented in this paper is of course not conclusive for 
collections other than the Cran-200. The first extension of th:.s experiment, 
then, would be to perform. a similar common word analysis on other collections. 
One reason for the apparent good performance of the significant stem dictionary 
is that the Cran-200 thesaurus is not that much better than the standard stem 
dictionary in the first place. 
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2) The current Cran-200 collection still contains a fair number 

of common words in the thesaurus vectors although these same words have been 
marked common in the thesaurus itself. This could also explain the lack 
of performance of the thesaurus as compared with the significant stem 
dictionary. Thus a new look-up run should be made on the Cran-200 collection 
using the current version of the thesaurus to generate vectors without 
so many common words in them. 

3) It would be interesting to determine more precisely the influence 
of infrequent words on retrieval. 

4) More careful analysis of feedback results from this investigation 
should be made. 
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Appendix I 
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Query Standard Stem Significant Stem Thesaurus 
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Standard Stem Significant Stem Thesaurus 
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Run 0—42 Queries (Plus 0 Nulls) — Wordstem Feedback = Standard 

A Full Search with Ore Iteration of Feed- 
back Using Word Stem Dictionary 
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14 




3 


126 


28 


0.6622 


0.4191 


15 




3 


129 


28 


0.6749 


0.4150 


16 




2 


131 


28 


0.6805 


0.409? 


17 




3 


134 


28 


0.6921 


0.4069 


18 




1 


135 


28 


0.6947 


0.40x5 


19 




2 


137 


28 


0.7054 


0.3980 


20 




2 


139 


28 


0.7148 


). 3948 


30 




11 


150 


26 


C . 7612 


0. 3702 


50 




19 


169 


20 


0.8448 


0.3531 


75 




16 


185 


9 


0.9321 


0.3514 


100 




2 


187 


8 


0.9395 


0.3491 






11 


19 8 








10.0% 




139 


139 


28 


0.7148 


0.3948 


25.0% 




30 


169 


20 


0.8448 


0.3531 


50.0% 




18 


187 


8 


0.9395 


0.3491 


75.0% 




6 


193 


3 


0.9683 


0.3484 


90.0% 




1 


194 


2 


0.9742 


0.3483 


100.0% 




4 


198 


0 


1.0000 


0.3486 


Keys: 


NR 


= Number 


’ of 


Relevant 








CNR 


= Cumulative 


s Number 


of Relevant. 






NQ 


- Number of 


Queries 


used in the 


Average 



not Dependent on any Extrapolation. 

% = Percent of Total Number cf i Leir.s in Collection. 
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1.0 



▲ Standard Stem 
0 Significant Stem 
■ Thesaurus 



0.8 



Precision 



0.6 - 



0.4 - 




0.2 
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6.1 



Run 1—42 Queries {Plus 0 Nulls) — Cranmine Feedl = Sig Stem 

Full Search with One Iteration of Feed 
back using Stems with Common Words 



Symbol 



RUN 1 



Rank 


NR 


CNR 


NQ 


Recall 


Precision 


1 


35 


35 


42 


0.2405 


0.8333 


2 


28 


63 


41 


0.4146 


0.7619 


3 


18 


81 


35 


0.5011 


0.7063 


4 


12 


93 


32 


0.5479 


0.6528 


5 


9 


102 


31 


0.5848 


0.6111 


6 


8 


110 


31 


0.6170 


C. 5794 


7 


5 


115 


29 


0.6393 


0.5510 


8 


5 


120 


27 


0.6594 


0.5349 


9 


3 


123 


26 


0.6772 


0.5170 


10 


2 


125 


23 


0.6368 


0.5038 


11 


2 


127 


22 


0.6941 


0.4940 


12 


4 


131 


21 


0.7128 


0.4912 


13 


4 


135 


20 


0.7273 


0.4893 


14 


2 


137 


20 


0.7329 


0.4B43 


15 


2 


139 


20 


0.7448 


0.4800 


16 


2 


141 


19 


0.7525 


0.4767 


17 


1 


142 


19 


0.7555 


0.4723 


18 


1 


143 


19 


0.7603 


0.4684 


19 


0 


143 


19 


0.7603 


0.4637 


20 


1 


144 


19 


0.7642 


0.4606 


30 


10 


156 


18 


0.8064 


0.4429 


50 


20 


176 


11 


0.8685 


0.4355 


75 


6 


182 


6 


0.9216 


0.4310 


100 


4 


166 


2 


0.9397 


0.4291 




12 


198 








10.0% 


144 


144 


19 


0.7642 


0.4606 


25.0% 


32 


L76 


11 


0.8885 


0.4355 


50.0% 


10 


186 


2 


0.9397 


0.4291 


75.0% 


2 


188 


0 


0.9504 


0.4275 


90.0% 


0 


188 


0 


0.9504 


0.4269 


100.0% 


10 


198 


0 


1.0000 


0.4278 


Keys: NR 


= Number 


* of 


Relevant . 







CNR = Cumulative Number of Relevant. 

HQ = Number of Queries used in the Average 
not Dependent on any Extrapolation. 

% = Percent of Total Number of Items in Collection. 




Document Level Averages (2) 






V-44 



Run 2 



Symbol 



0 




42 Queries (Plus 0 Nulls) — Thesaurus Feedback 

A Full Search with One Iteration of 
Feedback 









RUN 2 






Rank 


NR 


CNR 


NQ 


Recall 


Precision 


1 


31 


31 


42 


0.2099 


0.7381 


2 


24 


55 


41 


0.3541 


0.6667 


3 


10 


65 


36 


0.3888 


0.5714 


4 


15 


80 


36 


0.4592 


0.5536 


5 


6 


86 


34 


0.4811 


0,5060 


6 


4 


90 


34 


0.5012 


0,4663 


7 


8 


98 


34 


0.5399 


0.4515 


8 


9 


107 


33 


0.5807 


0.4452 


9 


6 


113 


29 


0.6138 


0.4389 


10 


2 


115 


28 


0.6232 


0.4254 


11 


6 


123. 


27 


0.6506 


0.4239 


12 


3 


124 


25 


0.6625 


0.4186 


13 


4 


128 


25 


0.6787 


0.4160 


14 


1 


129 


25 


0.6821 


0.4087 


15 


2 


131 


24 


0.6928 


0.4047 


16 


1 


132 


24 


0.6975 


0.3998 


17 


3 


135 


24 


0. 7142 


0 . 3982 


18 


2 


137 


23 


0.7249 


0.3958 


19 


2 


139 


23 


0.7327 


0.3936 


20 


3 


142 


23 


0.7426 


0.3929 


30 


15 


157 


22 


0.7990 


0.3777 


50 


18 


175 


15 


0.8886 


0.3662 


75 


10 


185 


10 


0.9331 


0.3616 


100 


0 


185 


10 


0.9331 


0.3583 




13 


198 








10,0% 


142 


142 


23 


0.7426 


0.3929 


25.0% 


33 


175 


15 


0. 8886 


0.3662 


50.0% 


10 


185 


10 


0.9331 


0.3583 


75.0% 


9 


j.94 


2 


0.9774 


0.3580 


90.0% 


0 


194 


1 


0.9774 


0.3576 


100.0% 


4 


198 


0 


1.0000 


0.3580 


Keys : 


m = 


Number of Relevant. 








CNR = 


Cumulative 


Number of 


Relevant . 






NQ = 


Number of Queries used in the Average 






not Dependent on any 


Extrapolation. 






% = 


Percent of 


Total Number of Items in 


Collection 



Document Level Averages (3) 
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VI* Negative Dictionaries 
K. Bonwit and J. As te-Tonr.mann 



Abstract 

A rationale for constructing negative dictionaries is discussed. 
Experimental dictionaries are produced and retrieval results examined. 



1. Introduction 

Information retrieval often involves language processing, and 
language processing frequently leads to language analysis* When the in- 
formation initially appears in natural language form, it is desirable to 
perform some sort of normalization at the beginning of the analysis* A 

system often used in practice assigns keywords , or index terms , to identify 

the given information items. Dictionaries, listing permissible keywords 
and their definitions, are employed in this process. Sometimes, a negative 
dic t ionary is also used, to identify those terms which are not to be 
assigned as keywords. 

Various types of positive dictionaries, their construction and uses, 
have been discussed elsewhere [1, 2, 3J. The question of the negative 
dictionary# or, what to leave out, is a fuzzy one. It is generally agreed 

that ’’common function words”, such as ’’and”, "or", "but", which add to 

the syntax but not the semantics of a sentence, should be dropped for the 
purposes of information retrieval. Other words at the extreme ends of the 
frequency distribution cause a problem. For example, "information" and 
"retrieval" might appear in nearly every document of a collection on that 
subject (high frequency); if included as keywords, they would retrieve every- 
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thing. Conversely, if only one docurr.ent discusses "rr.icrof i ches " <low 
frequency) , and that word does not constitute one of the permissible 
keywords, that document may never be retrieved. As with most information 
retrieval problems, the goals of the system, either high recall or high 
precision, will determine how many words are to be included. In the 
SMART system# a standard list of 204 "common English words" is used as a 
negative dictionary for all collections. 

The general procedure used for dictionary construction consists in 
producing a concordance of the document collection with a frequency count, 
and including in the negative dictionary rare, low frequency words, common 
high frequency words, and words which appear in only nonsignificant contexts, 
such as "observe 11 in "we observe that ..." This process requires the 
choice of frequency cutoff points, and a definition of the notion of 
"nonsignificance". It presumes a priori that such deletions will not effect 
retrieval results too considerably. A preferable system would be one that 
produces a negative dictionary of those terms which can be shown to detract 
from retrieval efficiency, or at least, not to affect it. 

2. Theory 

The set of keywords chosen for identifying documents constitutes the 
index language . The number and type of words included will control the 
speci f ici ty ot the index language. Keen states [31 that 

"a dictionary which provides optimum specificity for a given test 
environment will exhibit a precision versus recall curve that is 
superior to all others probably over the whole per form, ance range." 

The purpose of this report is to exhibit a means of measuring specificity, 

o 
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and to show how a negative dictionary can be constructed to optimize index 
language specificity. 

The aim of a negative dictionary is to delete from the index 
language all words which do not distinguish, and leave only tnose words 
which discriminate , among the documents. If the documents are considered 
as points in a vector space, with the associated identifying keywords as 
coordinates, then documents containing many of the same keywords will be 
relatively close together. If all keywords are permitted, then the docu- 
ments will all cluster in the subspace defined by the common words; on 
the other hand, if only discriminators are permitted, the document space 
will "spread out 11 , since each discriminator separates the space into those 
documents it identifies and those it does not. 

The standard method for measuring "closeness", or correlation, of 
two document vectors v and w is the cosine: 



cos (v,w) 




where v^(w^) is the weight of the i^ keyword in document v (w) , and the 
sums run over all possible keywords. 

The ,, compactness ,, ("closeness together") of the points in the 
document space can be measured as follows: 

1) find the centroid c of all the document points, that is, 
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2 ) 

3) 



where v is the weight of the i tn keyword in document j , and 
N is the total number of documents; 



find the correlation of each document with the centroid, i.e., 
cos (jp/Vj), for all documents j; 

define the document space similarity, Q t as: 



Q = l cos (c,v.) 

j=l 3 



Q has values between 0 and N# higher values indicating more similarity 
among documents. The value 0 is never obtained since c is a function of 
the other vectors , and the value N is obtained only if all the documents 
are identical to the centroid. Normalized Q, i.e. Q/N, is just the 
average document-centroid correlation (though this value is never cal- 
culated in the work which follows). 

By calculating Q, using the terms provided by differing index 
languages I it is possible to measure and compare the specificity of these 
languages — a language is more specific the lower its Q. The question 
remains how to discover the optimal Q that will give the superior recall- 
precision curve described by Keen* 

To see what happens when a single keyword is deleted, let Q be 
defined as Q calculated with the i^ 1 term deleted (i.e., v, , left out of 
all calculations, for all documents j). Then, )q - | measures the change 

in document space similarity due to the deletion of term i. If Q. > Q r the 
document space is more “bunched up", more similar, when term J. is deleted. 



or term i is a discriminator. Conversely, i f < Q, deletion of term i 
causes the space to “spread out", to be more dissimilar, and deletion of 
term i may aid in retrieval. In the same way, is defined for a set of terms, 
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I - {i^, i^r * * * ' T ^at is, neasures the document space similarity 

when all the terms in set I have been deleted from tne index language. 

Since deletion of discriminators raises Q and deletion of non- 
discriminators lowers Q, some optimal set of terms I , should exist such 

min 

that Qj is minimal. It still remains to shown that the index language 
min 

consisting of the set of keywords remaining when the set I . is deleted 

from the total collection of keywords will be optimal in the sense of Keen. 

If the total set of keywords is K = {i_, i., . . i }, and I . = {i_, 

12 t min 1 

. . . , i } , min < t, then Figure 1 describes what should happen to 0 
min — 

as terms are successively deleted from K (a point (i^#Q) represents 
Q , , .ij i.e., Q for the index language given by K - {i , . . . , i . } ) . 

ij^ 1 D 

As non-discriminators are deleted, the document space spreads out and 

Q goes down to its minimum. Then as discriminators are deleted, documents 

that were distinguished are coalesced, the document space draws together , 

and Q goes up (until all documents are identically null) . 

It may 1 e hypothesized that retrieval will follow the same pattern. 

That is, using some method of retrieval evaluation, the best results will 

occur at , and as Q increases, retrieval "goodness" will decrease, 

min 

One measure of retrieval effectiveness is the rank of the last relevant 

document retrieved. If N is the average rank (over a group of queries) of 

the last relevant document retrieved, then assuming retrieval follows Q, 

N versus i will be as in Figure 2. As non-discriminators are deleted 
r 

(i, to i . ) , it is easier to find the relevant documents, and N goes 
1 min r * 

down until i . is reached. At that point discriminators begin to be lost, 
min 

the document space closes up# relevant documents move closer to non-relevant, 

o 
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more non- re levant are retrieved along with relevant, and goes back up. 

3, Experimental Results 

The ADI abstracts collection is used as a base for t sting the above 
predictions about the Q and N curves. The full (no common words deleted) 
vectors and the accompanying word stem dictionary are used. The dictionary 
terms are ranked twice: 

a) in order of increasing Q , i.e., with tne supposed discriminators 
at the end of the list; 

b) in order of decreasing frequency of occurrence (number of docu- 
ments appeared in) , with the least frequent terns at the end. 

Since the ADI collection contains ^218 keywords, only every 28^ (an arbitrary 

number) point of the curves is considered, i.e., what happens when terms 

1-28, 1-56, 1-84, . . . are deleted (using the orderings above) . At the 

selected cutoffs, query searches are performed, and the corresponding Q^'s 

and N '& calculated, 
r 

When the terms are dc jetod in increasing Q. order, the Q and N 
curves come out very much as predicted (Figure 3 and 4) t being both of 
approximately the same shape: dipping down to a minimum and shooting off 

at both ends (see Figure 5 for comparison) . Interestingly, no documents 
are "lost" (have all their keywords deleted) until all but 98 keywords 
are deleted, at which time shoots up, indicating that chose 98 terms are 
real discriminators. Also, the curve has a very large, flat middle 
"minimum" (discounting noise) area — deleting 28 or 36 x 28 terms does not 
make much difference. 

The keywords are thus divided into 3 sets {Figure A ) : 
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0 98 238 378 518 658 798 938 1078 1218 

i = NUMBER OF TERMS 

N r (TOTAL FOR 33 QUERIES) vs NUMBER OF CONCEPTS — DELETION BY Q 

ORDER 



0091 HOS 



VI-10 




A 



00 

CM 



00 

N 

O 



2 
cr 

00 UJ 
h- 
iO 



in 

l- 

CL 

UJ 

if) o 



CO 

0> 



o 



z 

o 

a 



u. o 



oo (r 

s 

3 

Z 

00 II 
N 

ro — 



00 

ro 

CM 



tr 

UJ 

CiD 

2 

3 

Z 



</> 

> 



a 




73 



vs NUMBER OF CONCEPTS - DELETION BY Q ORDER 



VI-11 



a) those on the right end whose deletion leads to better retrieval 
(lower ; 

b) the middle terms which do not make much difference; 

c) those at the left end which must be retained for good retrieval. 

The sharp drop on the right-hand side of the curves is somewhat 
misleading. If all the points along the drop were plotted (corresponding 
to deleting 1, 2, 3, . . ., 28 keywords), it could be seen that the minimum 
actually occurs after the first 10 terms are deleted. These 10 terms 
constitute the set a), and it turns out that for all 10 terms, < Q 

( Q without subscript is Q for the full index language). That is, these 
terms are of the type which according to predictions could be dropped from 
the index language, and the N curve shows that they should be. For all 
other terms (sets b) and c) ) , > Q. The members of set a) are therefore 

easy to identify and include in a negative dictionary: calculate Q for the 

full index language and for each keyword and put in the negative dictionary 
those keywords with < Q. 

The normalized recall, defined by 



R - 1 

norm 



l <r - i) 

v =l 

n * (N - n) 



for N the total number of documents, n the number of relevant documents and 
r^ the rank of the i ^ relevant document retrieved, is an alternate measure 
of retrieval effectiveness. The curve of normalized recall vs. terms deleted 
(Figure 6) delineates the same sets a) , b) , and c) that the curve did. 
Since high recall is an indication of good retrieval (as opposed to low N^) , 

o 

j^j ng the recall curve (by subtracting all values from 1) is required to 
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show that recall also follows the pattern of Q (Figure 7) . 

It is interesting to note the frequency classes into which the sets 
a) # b) , and c) fall. The nun-discriminating members of set a) exhibit the 
highest frequencies (40% - 100%) ; the "in-between" members of set b) 

have the lowest frequencies (0% - 10%) , v.nilo the discriminators of set 

c) have 10% - 40%. While the terms in each set occur in the above ranges, 

within a set they are not exactly in frequency order. Therefore, in terms 
of frequency, the dividing line between discriminators and non-discriminators 
is not a clear one, and its absolute value (here, 40%) is likely to change 
from collection to collection. Ihc use of relative Q’s to separate out 
the non-discriminators, however, does not require the choice of such a cur- 
off point, and is an easier criterion to apply in constructing a negative 
dictionary. 

When the terms are deleted in decreasing frequency order , the 
predicted curves do not show up (Figure 8 and 9) . Q is strictly decreasing 
(reading from the right) — the more terms deleted, the more the space 
spreads out. Since the terms are dropped in approximately the order a) , 
c) , b) , the loss of non-discriminator a) terms causes the same initial dip. 
Since the c) terms occur in more documents (have higher frequencies) than 
the b) terms, deleting them continues the process of spreading out the docu- 
ment space, until documents are identified only by a stray, "rare" word frem 
set b) . (In Q order, deleting terms from set b) has the opposite effect? 
documents that were "pulled away" from the centroid by odd words now move 
in closer together as terms from set b) are deleted, and Q goes up.) 
has its initial dip resulting from the loss of the terms of set a) , and 
then rises sharply as the discriminating terms of set c) are lost and the 
remaining keywords prove to be poor identifiers. In this case, documents 
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N r (TOTAL FOR 29 QUERIES) vs NUMBER OF CONCEPTS - DELETED BY FREQUENCY 

Figure 9 
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are "lost" much more quickly, after only 560 keywords are deleted. 

It is interesting to look at the keywords that fall into sets a) . 
b) , and c) . Table 1 gives the 10 members of set a) in increasing Q ordtir 
and their frequencies of occurrence (cut of 82} . 



Keyword 

off 

the 

and 

a 

in 

for 

to 

in formation 

is 

are 



Frequency 

78 

77 

80 

62 

61 

54 

53 

44 

46 

38 



"able 1 



Nine of the ten are identifiable as "common function words” without particular 
semantic content. The tenth, the term "information", also shows up as a 
non-discriminator, for this particular collection. Since the ADI collection 
covers documentation, this is not surprising. The fact that "information" 
does occur in set a) is an indication that the Q criterion will be helpful 
in constructing negative dictionaries tailored to the collection with which 
they will be used. 

When 40 x 28 terms are deleted, the 98 which remain comprise set c) , 
the so-called discriminators. Many of the 98 can classify as "content 
words" — "request", "education", "thesaurus", "retrieve" (see Table 2). On 
the other hand, several "function words" also occur, e.g., "at", "as", "it", 
"not", "has", "was". That is, in the ADI collection compo^d of abstracts 
(rather than full texts) , these words serve to "distinguish" between those 

o 
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Keywo ~d 



Frequency K eyword 



Frequency Keyword 



Frequency 



index 


19 


usage 


12 


tape 


7 


library 


10 


procedure 


7 


produce 


11 


science 


12 


national 


6 


role 


8 


exchange 


3 


chemical 


5 


manual 


6 


search 


12 


program 


17 


recognition 


3 


process 


14 


publi cation 


5 


editing 


2 


service 


10 


journal 


10 


new 


11 


documents 


19 


logic 


4 


been 


13 


center 


7 


reference 


6 


not 


4 


definition 


3 


as 


23 


rules 


2 


technical 


9 


me chanized 


3 


remote 


1 


computer 


23 


it 


9 


interrogation 


1 


read 


6 


communication 


7 


microfilm 


4 


character 


5 


tes t 


5 


has 


15 


copy 


7 


can 


11 


prepare 


5 


be 


16 


education 


4 


graduate 


3 


book 


3 


material 


4 


into 


5 


use 


13 


by 


27 


an 


21 


at 


18 


concept 


7 


training 


6 


retrieve 


28 


need 


11 


that 


11 


analysis 


7 


level 


3 


abstract 


5 


fi le 


6 


organization 


7 


catalogue 


1 


date 


14 


facet 


1 


mathematical 


1 


thesaurus 


4 


vocabulary 


4 


access 


5 


sys tem 


33 


have 


10 


store 


7 


from 


] 7 


or 


15 


handle 


8 


method 


13 


which 


14 


school 


4 


page 


5 


citation 


4 


literature 


5 


transfromation 


2 


comparison 


4 


word 


5 


machine 


11 


relation 


5 


was 


5 


image 


1 


request 


5 


IBM 


4 


text 


7 


foreign 


1 


name 


2 


automatic 


8 


special 


8 






Keywords 


are in 


decreasing Q. order, 


r reading down the col 


urn ns . 


That is, "index" 


is the 


best discriminator. 


bein 


better than "technical 


which is better 


than "us 


ege" , which is better than "tape", which 


is betti 


than "name", whicn is t Y 


ia worst discriminator in 


set c) . 





That 



Set c) -- Discriminators 



Table 2 
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"documents" in which they appear and those in which they do not. Again, 
the Q criterion is matching the dictionary to the collection to produce 
maximal retrieval in a mechanical way without the benefit of human judgment. 



The members of set b) appear in an average of two documents each. 



Loth "function words 1 ’ like "would" and "content words" like "overdue" and 
f, ef f iciency " are found. Since function words are found in all three sets 
(and therefore at all frequency ranges) , it is clear that a criterion of 
frequency of occurrence alone is not going to find all function words. 

At the same time, it will net be a good judge of true discriminators. 

4. Experimental Method 

The above results are produced in an three-step process: 



order of keyword concept numbers, frequency of occurrence, 
and their total sum of vreights (over all documents) . A 
second program sorts this file info decreasing frequency 
order ; 

3) a third program works with the full documents and query 
vectors t and either of the ti. rm- frequency -weight files to 
perform the deletion of keywords and the search runs. 

A) Calculating 

The first program inverts the document- term vectors and works with 



this new file and the term- frequency-weight file it creates. It finds the 
elements of the centroid vector £ by dividing the total sums of weights for 



1) a LOCKUP run produces full document and query vectors, 
and a l. ; st of all word stems used; 



2) a FORTRAN program reads document- term vectors, calculates 
Q for each term i and produces a file in increasing Q 
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r 2 

each term by N, the number of documents. To calculate Q, it saves it v. . 

i=- 1 13 

n 2 

for each document j, and £c. for the centroid. Then 



N . , 1 1 l 

Q - l " 1 



I v . . - c . 
i=l 13 1 



3=iV^ v ij 2 • i c i 2 Vic. 2 j=i Vi v 



where t is the total number of terms, and the values of v, , are obtained 

13 

from the terra- document f..le. As the program goes along, it also saves 
t 

) v, . • c. for each document j- Then 
i=i 13 1 



Vk 



t 

N £ <V ij 

I 1=1 



c . ) - V. . • c T 

i k} k 



2 > - j - 1 






2 2 
] ~ v Rj 



where the sums to t are all stored values and the values involving k are 
in the program's files. 



B) Deleting and Searching 

The third program also inverts the document- term file, and keeps 

r> 2 

track of l v — for all documents j, adjusting the values of the s';ms as 

v 2 

terms are deleted. This program finds and calculates 



1-56 ) * 



, in a manner similar to that described above. 



To perform searching a query w and its relevancy decisions are read 
in. Using pointers to keep track o f which terms are deleted (which part of 
the term- document file to ignore), the query is correlated with each docu- 
ment in the collection of fulJ vectors, then with document vectors with 28 

terms deleted, then with 56 deleted, and so on. The cosine Tv. , * w, / 

Ck L 13 i 
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are stored, the v. . are 
3 O 



r? 2 r 2 r 

A/ )v. ■ * )w. can be calculated, since the )v. 

v L l j ^ l u i 

in the inverted term- document file# and w was just read in. The ranks of 
the relevant documents can be found by comparing cosines (number of docu- 
ments with a higher cosine = rank - 1). Typical results are shown in 
Table 3. The output format is as follows; 



the iteration number indicates how many groups of 28 keywords were 
deleted; 

Cl » average cosine of the relevant documents? 

C2 = normalized recall? 

N = rank of last relevant document? 
r 

Q : = Q for the iteration given by the iteration number? 

nR ^document n is relevant? the next two numbers are its rank 
and correlation with the query. 



The SMART routine AVERAGE is used to compare retrieval results for 
different index languages. Some of the results for deleting terms in 
increasing £h order, in particular, iterations 0, 1, 9, 36, and 40, are 
shown in Figure 10 (which labels these Run 0, 1, 2, 3, and 4, respectively). 

The recall-precision curves show that deleting concepts does improve retrieval 
effectiveness. By comparing entries in the table of recall-precision values 
(Table 4), it can be seen that Run 1 falls on top of Run 2. That is, retrieval 
performance is about the same whether 28 or 9 x 28 keywords are deleted, but 
in either case, performance is better than when no terms are deleted. And 
when only 98 keywords are left (Run 4) , the performance is still better 
than with the full index language (Run 0), falling halfway between best 
and worst. 

To test the effectiveness of the negative dictionary created by the 
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Iteration 5 Query 24 Iteration 6 Query 24 Iteration 7 Query 24 Iteration 8 Query 24 

C1=0- 196 02=0.9710807 01=0.196 C2=0.971C807 01=0.198 02=0.9710807 01=0.199 02=0.9710807 

NR 22 Q 23.997940 NR 22 Q 24.046610 NR 22 Q 24.113150 NR 22 Q 24.160200 
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Q criterion (i.e., the dictionary consists of the terms in set a) ), 
retrieval results should be compared vrith those obtained on the same 
collection using the 204 "common English words" list as a negative dictionary. 
The latter collection is not available on the SMART system, so results are 
compared with those obtained using the thesaurus dictionary, which lumps 
synonyms together as well as deleting the 204 words. As show’n in Figure 
11, the results with the Q negative dictionary (Run 1 = iteration 1) are 
just about the same as those for the thesaurus, except in che low recall area. 
Since thesaurus construction involves a large amount of hand work and human 
judgment while the Q negative dictionary can be generated mechanically, the 
Q method is preferable if high recall is desired, and the time and effort 
saved by not preparing a thesaurus may justify the use of the Q method 
even if precision is the goal. 



5. Cost Analysis 

The basic rationale for negative dictionaries is that they delete many 
of the frequent keywords, thus reducing the size of files, and lowering storage 
and search costs. There is a tradeoff between file size and retrieval effec- 
tiveness, and a point of balance between the two has to De found. From Figure 
10, it can be deduced that deleting 9 x 28 terms leads to about the same 
retrieval results as deleting only 28 terms, and if any terms are dropped, 
all 252 can be. However, deleting 36 x 28 (Run 3} lowers retrieval perfor- 
mance only slightly. Is the saving worth deleting the extra terms? 

The question can be rephrased as follows: what is the saving in 

costs when extra terms from set b) are deleted? The keywords in set a) are 
deleted to improve retrieval (Figure 10, Run 1). Deletion of keywords in 
q e et b) has a lesser effect on retrieval (Run 2 and 3) , but the terms in 
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RECALL LEVEL AVERAGES 

Figure 11 
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set b) constitute the bulk of the terms to be stored* How much do they cost 
versus how much do they add to retrieval? 



kind of results it produces. Assume a print-out of all retrieval documents 
is required and the system works as follows; 



a) a full search is performed for each query r processed separately; 

b) results are in the form ''Document Title' 1 and "Reference Number", 
one line per document, with all documents retrieved printed out; 

c) the computer is the 360/65 under CLASP; 

d) the search program uses 250K and the file organization of the. 
SMART system. 



Diagramnatically, the process will appear as in Figure 12. Queries are read 
in, one at a time, and looked up in the dictionary (A) . Each query is corre- 
lated with all members of the document file (B) and ranked. The document 
titles for all documents up to the last relevant are luund in the title file 
(C) and returned to the user (D) . (Using all documents up to the last 
relevant is a convenient measure of how many documents the average user will 
see . J 



terms t? Step (A) is independent of t — each word of the query must be 
checked for occurrence in the dictionary; non-occurrence tikes as long to 
discover as occurrence. The search step (B) depends on t in two ways: as 

general file size is reduced, accessing time will go down, and as vector 
length is reduced, the number of calculations required to compute query- 
document correlations will be lower. Steps (C) and (D) are independent 
of t, but are a function of , the rank of the last relevant document 



The cost accounting will depend on the system being used and the 



What is the dependence of these operations on the total number of 
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Systen Organization 



Figure 12 
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(since all documents with rank < N are printed, relevant or not) . 

— r 

Accessing time is related to number of disc tracks read. The API 
collection with all keywords included occupies 4 tracks. Deleting about 
200 terms will reduce the number to 3, but even if all the terms found in 
set b) are deleted, the number of tracks required remains at 3. For 35 
queries, the total time saved with reduction to 3 tracks is 1.2 sec. 

In addition, 50 millisec, is saved in computation time, or for 200 terras 
deleted, 10 more sec. saved,. 

The rank of the last relevant document, N , generally increases as 
terms are deleted, resulting in more output lines and an increase in time 
and cost. Table 5 gives exact figures, in terms of dollars saved, when 
various numbers of terms are deleted. Figure 13 is a plot of these values, 
showing the savings in search resulting from deduction from 4 tracks to 3, 
and the total savings , as functions of the number of terms deleted. 



6. Conclusions 

Clearly, a negative dictionary is needed; deletion of some keywords 
definitely improves retrieval. Deleting words in order of increasing Q 
seems the better method? while the IC curve for frequency order has a lover 
minimum point, it is very unstable. Terras from set a) , with 2^ < 
are to be deleted? discriminators fren set c) are to be retained. The 
question of v T hat to do with the middle (set b) ) depends on the needs of 
the user. For a large collection, deleting all but the most %*ital terms 
will save storage costs and search time, possibly at some small loss i: 
retrieval. The ADI collection is toe small to show very significant 
differences in cost when terms ire deleted. 
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Number of Decrease 



Number of 


terms 


Save in 


in N 


Save in 


Total 


terms 


deleted 


Search 


(lines 


Print 


Saved 


remaining 


from set b) 


(dollars) 


saved) 


(dollars) 


(dollars) 



1190 


0 


0.0 


0 


0.0 


0.0 


1162 


28 


0.0 


0 


0.0 


0.0 


1134 


56 


0.00016 


0 


0.0 


0.00016 


1106 


84 


0.00024 


- 2 


-0.0026 


-0.00236 


1078 


112 


0.00033 


0 


0.0 


0.00033 


1050 


140 


0.00042 


4 


0.0052 


0.00562 


1022 


168 


0.0005 


3 


0.0039 


0.0044 


994 


196 


0.0006 


5 


0.0065 


0.0071 


966 


224 


0.0667 


11 


0,0143 


0.0810 


938 


252 


0.0668 


11 


0.0143 


0.0811 


882 


308 


0.0670 


1 


0.0013 


0.0683 


826 


364 


0.0671 


- 1 


-0.0013 


0.0658 


770 


420 


0.0672 


- 6 


-0.0078 


0.0594 


714 


476 


0.0674 


-13 


-0.0169 


0.0535 


658 


532 


0.0676 


-29 


-0.0377 


0.0299 


546 


644 


0.0678 


-29 


-0.0377 


0.0301 


434 


756 


0.0682 


-41 


-0.0533 


0.0149 


322 


868 


0.0685 


-47 


-0.0611 


0.0074 


210 


980 


0.0688 


-61 


-0.0793 


-0.0105 


In terms of 


cost, the 


optimal number 


of terms 


to delete 


from set b) 



about 950. 



Cost Statistics 



Table 5 
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The algorithm presented for aeterming the set a) requires the cal- 
culation of Q i for each term i/ and the storage of the entire term- document 
file. By judicious handling of the values involved, a farily efficient 
method for discovering set a) is produced. This procedure should be 
reasonably practical to run on a large collection, at least for generating 
the initial negative dictionary. Updates for the dictionary when the 
collection changes could be produced by rerunning the programs on a repre- 
sentative sample of the revised collection. 
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VII. Experiments in Automatic Thesaurus Construction for 

Information Retrieval 

G« Salton 



Abstract 



One of the principal intellectual as well as economic problems 
in automatic text analysis is the requirement for language analysis tools 
able to transform variable text inputs into standardized, analyzed 
formats. Normally, word lists and dictionaries are constructed manually 
at great expense in time and effort to be used in identifying relation- 
ships between words and in distinguishing important "content" words from 
''common 11 words to be discarded. 

Several new methods for automatic, or semi-automatic, dictionary 
construction are described, including procedures for the automatic 
identification of common words, and novel automatic word grouping methods. 

The resulting dictionaries art* evaluated in an information retrieval 
environment. It appears that in addition to the obvious economic advantages, 
several of the automatic analysis tools offer improvements in retrieval 
effectiveness over the standard, manual methods in general use. 

1. Manual Dictionary Construction 

Most information retrieval and text processing systems include as 
a principal component a language analysis system designed to determine the 
"content", or "meaning" of a given information iten. In a conventional 
library system, this analysis may be performed by a human agent, using 
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established classification schedules to determine what content identifiers 
will best fit a given item. Other '’automatic indexing” systems are known 
in which the content identifiers are generated automatical ly from document 
and query texts. 

Since the natural language contains irregularities governing both 
the syntactic and the semantic structures, a concent analysis system must 
normalise the input texts by transforming the variable, possibly ambiguous, 
input structures into fixed, standardized content identifiers. Such a 
language normalization process is often based on dictionaries and word lists, 
which specify the allowable content identifiers, and give for each identifer 
appropriate definitions to regularize and control its use. In the auto- 
matic SMART document retrieval system, the follov/ing principal dictionary 
types are used as an example [1]: 

a) a ne ga!i v€ dictionary containing "common" terms whose use 
is proscribed for content analysis purooses; 

b) a thesaurus » or synonym dictionary, specifying for each 
dictionary entry, cne or more synonym categories, or con- 
cept classes; 

c) a phr cse dictionary identifying the most frequently used 
word or concept combinations; 

d) a hierarchical arrangement of terms cr concepts, similar 

in structure to a standard library classification schedule. 

While well-constru 'ted dictionaries are indispensable for a consister. 
assignment of content identifiers, or concepts, to information items, the 
task of building an effective dictionary is always difficult, particularly if 
the environment within which the dictionary operates is subject to change, 
or if the given subject area is relatively broad and nonhonogeneous . [2] 
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The following procedure summarizes the largely manual process normally 
used by the SMART system for the construction of negative dictionaries and 
thesauruses [3] : 

a) a standard common word list is prepared consisting of 
function words to be excluded from the dictionary; 

b) a key v:ord- i n - context , or concordance listing is generated 
for a a ample document collection in the area under 
consideration, giving for each word token the context, 

as well as the total occurrence frequency for each word; 

c) the common word list is extended by adding new non- 
significant words taken from the concordance listing; 
in general, the words added to form the revised common 
word list are either very high frequency words 
providing little discrimination in the subject area under 
consideration, or very low frequency words which produce 
few matches I ween queries and documents; 

d) a standard suffix li st is prepared, consisting of the 
principal suffixes applicable to English language 
material ; 

e) an automatic suffix removel program is then used to reduce 
all remaining (nonccmmon ) words to word stem form; the 
resulting word stem dictionary may be scanned (manually) 
in order to detect inadequacies in the stemming procedure; 

f) the most frequent significant word stems are then 
selected to serve as "centers” of concept classes in the 
thesaurus under construction; 

g) the word stem dictionary is scanned in alphabetical order, 
and medium- frequency word stems are either added to 
existing concept classes, or are used as "centers' 1 of 

new concept classes; 

h) the remaining, mostly low frequency, word stems are 
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inserted as members of existing word classes; 

i) the final thesaurus is manually checked for internal 
consistency, and printed out. 

It has been found experimentally that thesauruses resulting from 
these processing steps operate most satisfactorily if ambiguous terms are 
entered only into those concept classes which are likely to be of interest 
in the subject area under consideration — for example, a term like "bat" 
need not be encoded to represent an animal if the document collection 
deals with sports and ball games. Furthermore, the scope of the resulting 
concept classes should be approximately comparable, in the sense that the 
total frequency of occurrence of the words in a given concept class should 
be about equal; high frequency terms must therefore remain in classes by 
themselves, while low frequency terms should be grouped so that total con- 
cept frequencies are equalized. [3] A typical thesaui^js excerpt is shewn 
in Table 1 in alphabetical , as well as in numerical, order by concept 
class number. (Class numbers cibove 32,000 designate "commorl’ words . ) [^] 

A number of experiments have been carried out with the SMART system 
in order to compare the effectiveness in a retiueval environment of manually 
constructed thesauruses, providing synonym recognition, with that of simple 
word stem matches in which word stems extracted from documents are matched 
with those extracted from queries. In general, it is found that the thesau- 
rus procedure which assigns content identifiers represent ing concept classes, 
rather than word stems, offers an improvement of about ten percent in 
precision fer a given recall level, when che retrieval results are averaged 
over many search requests. 

o 
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Alphabetic 


Order 


Numeric 


Order 


Word or 


Concept 


Concept 


Words or 


Word Stem 


Classes 


Class 


Word Stems 


v;ide 


438 


344 


obstacle 


will 


32032 




target 


wind 


345,233 


34 5 


atmosphere 

meteorolog 


winding 


233 




weather 


wipe 


403 




wind 


wire 


232,105 


346 


aircraft 

airplane 


wire -wound 


001 




bomber 

craft 

helicopter 

missile 

plane 



Typical Thesaurus Excerpt 
Table 1 




101 



VI 1-6 



A typical recall-precision output is shown in lig. 1 for hesaurus 
and word stem analysis processes. For the left-hand graph (Fig. 1 (a)) full 
document texts we^e used in the analysis, whereas document abstracts were 
used to produce Fig. 1 (b).* (5] 

In order to determine what thesaurus properties are particularly 
desirable from a performance viewpoint, it is of interest to consider briefly 
the rrain variables which control the thesaurus generation process [6]: 

a) word stem generation 

i) type of suffixing procedure used — whether fully 

automatic or based on a pre-existing suffix dictionary; 

ii) extent of suffixing — whether based on individual 
word morphology alone, or also incorporating word 
context ; 

b) concept class generation 

i) degree of automation in deriving thesaurus classes; 

ii) average size of thesaurus classes; 

iii) homogeneity in size of thesaurus classes; 

iv) homogeneity in the frequency of occurrence of 

individual class members (within a thesaurus class); 

v) degtee of overlap between thesaurus classes (that is, 
number of word entries in common be ween classes); 

vi) semantic closeness between thesaurus classes; 



* RecaJ 1 is the proportion of relevant material actually retrieve 1 , while 
precision is the proportion of retrieved material actually relevant. In 
general, one would like to retrieve much of what is relevant, while rejec* in,; 
much of what is extraneous, thereby pioducing high recall as well ns high 
precision. The curve closest to the upper right-hand corner of a typical 
recall-precision graph represents the best perfoi m, ar.ee , since iccali as j 11 
as precision is maximized at that point. 
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Comparison of Manual Thesaurus and Word Stem Processes 
(Averages over 82 documents, 35 quenes) 
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c) "common” word recognition 



i) degree of automation in common word recognition 
process ; 



ii) proportion of common words as a percentage of the 
entire dictionary; 



d) processing of linguistic ambiguities 



i) degree of automation in the recognition of 
linguistic ambiguities; 



ii) extent of recognition of ambiguous structures. 



The language analysis procedures incorporated into the SMART 
document retrieval system all use an automatic word suffixing routine 
based on a hand-constructed suffix dictionary. Furtherrru- . tic 

ambiguities represented, for example, by the occurrence o: . hs 

in texts are not explicitly recognized by the SMART analy, " -esu.* 

The two main variables to be considered in examining these motive 

ness are therefore the common word recognition and the c : . : aping 

procedures. These tv r o problems are treated in the remain. is 

study . 

2. Common Word Recognition 

In discussing the common word probleni^it is imp i ; / , firct of 

all, to distinguish common f unct ion words, such as prop-. * , coniunc 



‘•Although several language analysis systems use elaborate }i .ced.iic3 for 
the recognition of linguistic anbigu i t ies [7,8), it appe sre * ha t most 
potentially ambiguous structures are aut omat I c:a I Iv resolve 3 iy restiict; 
the application of a givan dictionary to u specific, we 1 1 - lof ir.ed sid iec 
area. 
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tions, or articles, from common content words. The former are easily ident 
f i G'j by construct irig a list of such terms which may remain constant ever 
many subject areas. The latter, typified by the word stem "automat 11 in a 
collection of computer science documents, consist of very high — or very 
lev; --- frequency terms which should not be incorporated into the standard 
concept classes of a thesaurus, because the respective terms do not ade- 
quately discriminate among the documents in the subject area under consider- 
ation. It is important that such words be recognized since their assignmen 
as content identifiers would produce high similarity coefficients between 
in format ion items which have little in common, and because their presence 
yjou.1 i magnify the storage and processing costs for the analyzed in: or mat 1 *n 
1 terns , 

To determine the importance oi the common content word recognition, 
a study was recently performed comparing the effectiveness in a retrieval, 
environment of a standard word-stem matching process, a standard thesaurur, 
and a word-stem procedure in which the common content words near ally 
identified as part of the thesaurus process were also recognized. [9] 
Specifically, a backward procedure was used to generate a word stem dic- 
tionary from a thesaurus by breaking down individual thesaurus classes and 
generating from each distinct word, or word stem, included in one of the 
thesaurus classes, an entry in the new stem dictionary. The main differcnc 
between this new sign if icant step dictionary and a standard stem diction ay 
is the absence from the dictionary of word stems cor responding to common 
functions and common content words normally identified only in a thesaurus. 
A comparison between significant ar.d stanjai.l stem dictionaries will there- 
fore produce evidence concerning the importance of common word deletion ; ro 
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document and query identifications, while the comparison between significant 
stem and thasaurus dictionaries leads to an evaluation of the concept 
classes and the term grouping methods used to generate the thesaurus. 

A recall -precision graph for the performance of the three diction- 
ary types is shown in Fig. 2(a), averaged over forty-two queries ar.d 
two hundred documents in aerodynamics. It may be seen from Fig. 2(a) 
that the thesaurus produces an improvement of some ten percent in pre- 
cision for a given recall value ever the standard stem process. Unexpect- 
edly, a further improvement is obtained for the significant stem dictionary 
over the thesaurus performance, indicating that the main virtue of the 
aerodynamics dictionary being tested is the identification of common 
v/or^s , rather than the grouping of term into concept classes. For the 
collection under study, the significant stem dictionary contains about 
twice as many common word entries as the standard stem dictionary. 

Obviously, the recall-precision results reflected in the graph 
of Fig 2(a) cannot be used to conclude that synonym dictionaries, or 
thesauruses based on term grouping procedures are useless for the 
analysis of document and query content in information retrieval. Quite 
often, special requirements may exist for individual queries, such as, 
for example, an expressed need ior very high recall, or precision; in 
such circumstances, a thesaurus nay indeed turn out to be assent idl. 

Consider as an example, the output graph of Fig. 2(b) In which 
a global evaluation measure, Known as rank rec a ll , is plotted for the 
ten queries (cut of forty-two) which were i^'ntlficd by exactly six 
thesaurus concepts.* ft is seen th it fra queries with very icw relevant 
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documents in the collection, the thesaurus in fact is able to idertify the 
relevant items more effectively than either of the stem dictionaries. As 
the number of relevant documents per query increases, the stem methods catch 
up with the thesaurus process. 

In view of the obvious importance of common word identification, one 
may inquire whether such entries might not be identifiable automatically, in- 
stead of being manually generated by the procedure outlined in the previous 
sectioi. . This question was studied using the following mathematical model. 
Consider the original set of terms, or concepts, used to identify a given 
query and document collection, and let this term set be altered by sel ecti ve 
deletion of certain terms from the query and document identifications. One 
of two results will then be obtained depending on the .ype of terms actually 
removed : 

a) if the terms to be removed are useful for content analysis 
purposes, they will provide discrimination among the documents* 
and their removal will cause the document space to become more 
"bunched-up" by rendering all documents more similar to each 
other, that is, by increasing the correlation between pairs of 
documents ; 

b) on the other hand, if the terms being removed are common words 
which do not provide discrimination, the document space will 
spread out, and the correlation between document pairs will 
decrease . 

This situation is illustrated by the simplified model of Pig. 3, 
where each document is identified by r x'» and the similarity between tv:o 
documents is assumed inversely proportional to the distance between corre- 
sponding x $ s. The conjecture to be tested is then the following: a term 
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b) Document Spoce After Removal of 

iitefiii Discriminators 




c) Document Space After Removal of 
i i^aiacc Nondiscr iminotors 



Changes in 



Documenl 

Deletion 



Space Compactness Following 
of Certain Terms 
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to be identified =s a "common" word, and therefore to be removed from 
the set of potential content identifiers (and from the set of allowable 
+ hesaurus concepts) is one which causes the document space to spread 
out by decreasing its compactness. 

The following procedure is used to verify the conjecture [10] . 
Consider a set of N documents, and let each document j be represented 
by a vector of terms , or concepts, v. , where v.^ represents the weight 
of term i in document j. Let the centroid c of all document points in 
a collection be defined as the "mean document", that is 



c . 

— i 



77 £ v. . 

K j=r 13 



the centroid is then effectively the center of gravity of the document 
space. If the similarity, between pairs of documents i and j is given 
by the correlation r(v,,v,), v/here r ranges from 1 for perfectly similar 
items to 0 for completely disjoint pairs, the compactness Q of the 
document space may be defined as 



H 

Q - r r(c,v. ), C*Q*N 
j=l ] 

that is, as the sum of the similarities between each document and the 
centroid; greater values of Q indicate greater compactness of the 
document space. 

Consider then the function Q. defining the compactness oi the 
document space with tern i deleted . If Q.>0, the document space is more 
compact and term i is a discriminator; contrariwise, if Q.<Q, the space 
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is more spread out, and deletion of term i may produce better retrieval. 
Since deletion of discriminators raises 0, and deletion of nondiscriminators 
(common words) lowers Q, an optimal set I of terms must exist such that 
becomes minimal. 

The following experimental procedure may th^n be used: 

a) consider each term i in order and compute Q^; 

b) arrange the terms in order of decreasing Q. (that is, 
with terms causing the greatest decrease coming first); 

c) define the set I of common terms to be deleted as the set 
leading to a minimal 0. 

Fig. 4 shows the evaluation results obtained by using this process 
with a collection of eighty-two documents in the field of documentation, 
together with thirty-five user queries. A total of 1218 distinct word stems 
were initially available for the identification of documents. It is seen 
from Fig. 4(a) that the evaluation results verify the model completely: 

a) as high frequency, nondiscriminators are first deleted, 
the space spreads out, and the cori'esponding recall- 
precision output (following deletion of 252 terms) is 
improved by about twenty percent; 

b) when additional terms are deleted, the compactness of 
the space begins to increase as discriminators are 
removed, and the recall-precision performance deteri- 
orates; the middle curve of Tig. 4(a) represents the 
performance following deletion of 1120 terms (in 
decreasing Q order), at which tire the retrieval 

e f f ect i veness h js -ii ready diminished by about ten percent. 
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A comparison between the standard thesaurus performance and a word 
stem method with the top twenty-eight common terms deleted is shown in Fig. 
4(b). Is is seen that the thesaurus process is somewhat superior only 
at the low recall end with the two graphs being nearly equivalent over 
most of the performance region. 

The results of Fig. 4 thus confirm the earlier studies of Fig. 2 
in the sense that word stem matching methods produce performance parameters 
nearly equivalent to those obtainable by standard thesauruses, providing 
only that common word stems are appropriately identified, and removed as 
potential content identifiers. 

3. Automatic Concept Grouping Procedures 

For many years, the general classification problem consisting of 
the generation of groups, or classes, of items which are similar, in some 
sense, to each other has been of major concern in many fields of scientific 
endeavor. In information retrieval, documents are often classified by 
grouping them into clusters of items thereby simplifying the information 
search process. Alternatively, terms or concepts, are grouped into 
thesaurus classes in such a way that synonyms and other related terms are 
nil identifiable by the same thesaurus class numbers. 

In section 1 of this report , various criteria were specified for 
the manual, or intellectual construction of thesaurus classes. Since the 
manual generation of thesauruses requires, however, a great deal of time 
and experience, experiments have been conducted for some years leading 
to an automatic determination of thesaurus classes base! on the properties 
of the available document collections, that is, on the assignment of 
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terms to documents, Vhe general process may be described as follows [11]: 

a) a term-document matrix is first constructed specifying the 
assignment of terms to documents, including term weights, if any; 

b) a term-term similarity matrix is generated from the term- 
document matrix by computing the similarity between each pair 

of term vectors, based on joint assignment of terms to documents; 

c) a threshold value is applied to the term- term similarity 
matrix to produce a binary term-term connection matrix in 
which two terms are assumed connected (that is , a 1 appears 
in the connection matrix) whenever the similarity between 
corresponding term vectors is sufficiently high; 

d) the binary connection matrix may be viewed as an abstract 
graph in v/hich each term is represented by a node, and each 
existing connection as a branch between corresponding pairs 
of nodes*, some function of this graph (for example, the 
connected components, or maximal complete sub-graphs of 

the graph) is then used to define the clusters, or classes 
of terms.* 

A number of investigators have constructed term classifications 
automat ically , using procedures similar to the ones outlined above [12, 13, 
14] . Unfortunately, the generation of the term-term connection matrix is 
time-consuming and expensive when the number of terms is not very small. 

For this reason, less expensive automatic class i f icat ion methods, in which 



*A connected compone nt of a graph is a subgraph for which each pair of 
nodes is connected by a path (a chain of branches); in a maximal complete 
subgraph , each pair of nodes is connected by a direct branch, and no node 
not in the subgraph will exhibit such a connection to all other nodes of 
the subgraph. 
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an existing rough classification is improved by selective modification of 
the original classes, tend t c be used in practice. [15, 16] 



To determine the effectiveness of such automatically constructed 



term classifications in a retrieval environment, three types of experiments 
are briefly described involving, respectively, an automatic refinement 
of already existing classes; two fully automatic term classification 
methods; and a semi-automatic classification process. 



The first of these experiments consists in taking an existing term 



classification, or an existing thesaurus, and in refining the term classes 
by removing classes which are highly overlapping. II/] One such algorithm 
tried with the SMART system was based on the following steps (in addition 
to steps a) through d) already listeu): 



e) given the existing term classes, a class-class similarity 
matrix is constructed, using the procedures already outlined 
for the term-term matrix; 

f) a threshold value is applied to the class-class matrix 
to produce a binary class connection matrix*, 

g) each maximal complete subgraph defines a new merge d 
concept cla ss; 

h) merged classes that are subsets of other larger 
classes are removed » the remainder constituting 
the new merged classification. 

This procedure T /as used to refine the documentation thesaurus 



originally available for the ADI collection, consisting of eighy-two 



documents and thirty- five search revue 



rgeu" thesiuruscs 



were produced as follows: 
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a) thesaurus 1 with a total of 156 concept classes and approximately 
3,9 concepts per class; 

b) thesaurus 2 with a total of 239 concept classes, averaging 
1.4 concepts per class. [18] 

The global normalized recall and precision values, averaged over the thirty- 
five queries and exhibited in Table 2, show that some improvement in per- 
formance is obtainable with the refining process. 

The second, more ambitious group of experiments deals with the 
fully automatic classifies! ion procedures outlined at the beginning of 
this section. In one such study a large variety of graph theoretical 
definitions was used to define the term classes, including "strings of 
terms’ 1 , "stars", "cliques", and "clumps", and various threshold and 
frequency restrictions were applied to the class generation methods. [19] 

In general, it is feund that some cf the automatic classifications operate 
more effectively than unclassified keywords, particularly if "strong" 
similarity connections (with a large threshold value) are used, and only 
nonfrequent terms are permitted to be classified. A comparison of the 
automatic classif icat icr*s with manual thesauruses was not attempted in 
this case. 

Another fully automatic term classification experiment was recently 
concluded, using procedures very similar to the preceding ones, with a 
large experimental collection of 11,500 document abstracts in computer 
engineering. [20) A class refining process was implemented in that case, 
and many different parameter variations were tried. In the end, only 
modes t improvements were obtained over i standard word stem matching pro- 
cess, the author claiming that 
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"in relation to results yielded by our various (automatic) 
associative strategies, it must be concluded that retrieval 
by the simple means of comparing keyword stems provides a 
very good level of performance.” (20, p. 61] 

The last term classification experiment is based on a semi-automatic 
method for generating the original tern vectors used to produce the term- 
term similarity matrix. Specifically, a set of properties is manually 
generated by asking questions about each tern, and properly encoding the 
answers.* For each term, the corresponding property vector is then defined 
as the set of answers obtained in response to ten or twelve manually 
generated questions. When all term vectors are available, one of the auto- 
matic classification procedures may be used to obtain the actual thesaurus 
classification. (3, 21] 

Such a semi-automatic dictionary was constructed for documents 
in computer engineering. Its properties are compared with these of a 
manually constructed thesaurus in the summary of Table 3. It is seen that 
the semi-automatic thesaurus classes are much less homegeneou ;> — some classes 
being very large, and some very small — than the corresponding manual 
classes. Furthermore, fewer common words are identified in the semi-auto- 
matic thesaurus. 

The retrieval results obtained with the two thesauruses are included 
in Fig. 5. It is seen that the semi-automatic thesaurus produces a less 
effective performance than the corresponding manually constructed dictionary 



*A typical question might inquire whether a given tern in computer science 
refers to computer hardware (1), or to comp iter software (2), or whether the 
question is inapplicable to the given term (3); the chosen answer is. then 
encoded by the response number (n). 
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Properties 


Manual 

(Harris) Thesaurus 


Number of Concept Classes 


863 


Number of Word (stern) Entries 


2551 


Avg. Number of Words per Class 


3 


Number of Very Small (Single 
Word) Classes 


'*68 


Number of Very Large Classes 
(32 to 101 Words) 


2 


Number of Words Appearing 

in Two or More Classes 


52 


Proportion of "Common" Words 
Compared to Total Words 


37.3% 



Semi- Aut cnat i c 
(Bench) Thesaur 



29 5 9 
5197 
1.6 
2725 
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over most of the performance range. Only for very high recall 1 ; the 
effectiveness of both dictionaries approximately equal. 

4. Summary 

A number of manual and automatic dictionary construction procedures 
are described and evaluated in the present study, including in particular, 
automatic methods for the recognition of common words, and automatic or 
semi-automatic term grouping methods. It appears that the automatic comm: n 
word recognition methodology can usefully be incorporated into exist in/ 
text analysis systems; indeed, the effectiveness of the resulting extended 
word stein matching process appears equivalent to that obtainable wit:: 
standard thesaurus es. 

The effectiveness of the automatic term grouping algorithms is slit 
somewhat in doubt. The automatic grouping methods can probably be implenen 
more efficiently than the more costly manual thesaurus construction process 
However, no clearly superior automatic thesaurus, using term classes, has 
as yet been generated. [22, 23] 

for the present time, a combination of manual and automatic thesauy 
methods therefore appears most promising for practical applications, invclv 
the following steps: 

a) automatic common word recognition; 

b) manual term class if icat ion ; 

c) automatic refining of the manually produced classes. 
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